Abstract: Much effort has been expended on infrastructures, technologies, and standards for digital libraries of learning resources. Key objectives of these initiatives are to improve teacher and learner access to high-quality learning resources and to increase their use in order to improve education. In this article, we take a broader view by proposing a framework that incorporates inputs and process variables affecting these desired outcomes. Inputs include variables such as audience characteristics. Outcomes include increased teacher and student use of learning resources from digital libraries. In particular, two key variables are examined in this framework. The first is a professional development program aimed at educators on the topic of using educational digital libraries. The second is a simple end-user authoring service, called the Instructional Architect (IA). The IA helps users, particularly teachers, discover, select, sequence, annotate, and reuse learning resources stored in digital libraries.
To assess the viability of this framework, we designed a series of studies to test the assumptions that linked program, inputs, processes, and outputs. These studies involved 100 educators who participated in professional development workshops that focused on digital libraries and use of the IA. The studies used mixed methods, including electronic surveys, participant observations, interviews, and usage data (including server log file and artifact analyses). The usage data can be contrasted with the activities of approximately 50 'organic' users, that is users who used learning resources and the IA without the benefit of formal instruction.
Data sources involved a mix of qualitative and quantitative methods, including electronic surveys, interviews, participant observations, and server log file and artifact analyses. These multiple and complementary levels of analyses reveal that despite teachers reporting great value in learning resources and educational digital libraries, significant and lasting impact on teaching practice remains difficult to obtain.
Abstract: This paper explores two ways to help students locate most relevant resources in educational digital libraries. One is a more comprehensive access to educational resources through several ways of information access including browsing and information visualization. Another is personalized information access through social navigation support. The paper presents the details of the Knowledge Sea III system for comprehensive personalized access to educational resources and presented results of a classroom study. The study delivered a convincing argument for the importance of providing several ways of information showing that only about 10% of all resource accesses were made through the traditional search interface. We have also collected some good evidence in favor of the social navigation support.
Abstract: Previous research on using digital libraries in science classrooms indicated that middle school students tend to passively find answers rather than actively make sense of information they find in digital libraries. In response to this challenge, we designed a scaffolded software tool, the Digital IdeaKeeper, to support middle school students in making sense of digital library resources during online inquiry. This study describes preliminary results from a study to see how middle school students use different IdeaKeeper features. Participants include four eighth grade science classes taught by two teachers. Multiple sources of data were collected, including video recordings of students' computer activities and conversations, students' artifacts, log files and student final writing. Initial data analysis indicates that IdeaKeeper can facilitate online learners to engage in sense-making process in online inquiry.
Abstract: This paper describes G-Portal, a geospatial digital library of geographical assets, providing an interactive platform to engage students in active manipulation and analysis of information resources and collaborative learning activities. Using a G-Portal application in which students conducted a field study of an environmental problem of beach erosion and sea level rise, we described a pilot study to evaluate usefulness and usability issues in supporting geography learning, and in turn teaching.
Abstract: This paper introduces a new framework for building digital library collections and contrasts it with existing systems. It describes a significant new step in the development of a widely-used open-source digital library system, Greenstone, which has evolved over many years. It is supported by a fresh implementation, which forced us to rethink the entire design rather than making incremental improvements. The redesign capitalizes on the best ideas from the existing system, which have been refined and developed to open new avenues through which digital librarians can tailor their collections. We demonstrate its flexibility by showing how digital library collections can be extended and altered to satisfy new requirements.
Abstract: As an increasing number of digital library projects embrace the harvesting of item-level descriptive metadata, issues of description granularity and concerns about potential loss of context when harvesting item-level metadata take on greater significance. Collection-level description can provide added context for item-level metadata records harvested from disparate and heterogeneous providers. This paper describes an ongoing experiment using collection-level description in concert with item-level metadata to improve quality of search and discovery across an aggregation of metadata describing resources held by a consortium of large academic research libraries. We present details of approaches implemented so far and preliminary analyses of the potential utility of these approaches. The paper concludes with a brief discussion of related issues and future work plans.
Abstract: Significant barriers deter web page designers and developers from incorporating dynamic content from web services into their page designs. Web services typically require designers to learn service protocols and have access to and knowledge of dynamic application servers or CGI in order to incorporate dynamic content into their pages. This paper describes a framework for embedding discovery services in distributed interfaces that seeks to simplify this process and eliminate these barriers, making the use of the dynamic content available to a wider audience and increasing its potential for adoption and use in educational design.
Abstract: The process of authoring document-centric XML documents in humanities disciplines is very different from the approach espoused by the standard XML editing software with the data-centric view of XML. Where data-centric XML is generated by first describing a tree structure of the encoding and then providing the content for the leaf elements, document-centric encodings start with content which is then marked up. In the paper we describe our approach to authoring document-centric XML documents and the tool, xTagger, originally developed for this purpose within the Electronic Boethius project, otherwise enhanced within the ARCHway project, an interdisciplinary project devoted to development of methods and software for preparation of image-based electronic editions of historic manuscripts.
Abstract: This paper describes an ongoing collaborative effort across digital library and scientific communities in the UK to improve access to research data. A prototype demonstrator service supporting the discovery and retrieval of detailed results of crystallography experiments has been deployed within an Open Archives digital library service model. Early challenges include the understanding of requirements in this specialized area of chemistry and reaching consensus on the design of a metadata model and schema. Future plans encompass the exploration of commonality and overlap with other schemas and across disciplines, working with publishers to develop mutually beneficial service models, and investigation of the pedagogical benefits. The potential improved access to experimental data to enrich scholarly communication from the perspective of both research and learning provides the driving force to continue exploring these issues.
Abstract: Advances in both technology and publishing practices continue to increase the quantity of scientific literature that is available electronically. In this paper, we introduce the Information Synthesis process, a new approach that enables scientists to visualize, explore, and resolve contradictory findings that are inevitable when multiple empirical studies explore the same natural phenomena. Central to the Information Synthesis approach is a cyber-infrastructure that provides a scientist with both secondary information from an article and structured information resources. To demonstrate this approach, we have developed the Multi-User, Information Extraction for Information Synthesis (METIS) System. METIS is an interactive system that automates critical tasks within the Information Synthesis process. We provide two case-studies that demonstrate the utility of the Information Synthesis approach.
Abstract: In this paper we describe the methods, goals and early findings of the research endeavor ‘Comparative Interoperability Project' (CIP). The CIP is an extended interdisciplinary collaboration of information and social scientists with the shared goal of understanding the diverse range of interoperability strategies within information infrastructure building activities. We take interoperability strategies to be the simultaneous mobilization of community, organizational and technical resources to enable data integration. The CIP draws together work with three ongoing collaborative scientific projects (GEON, LTER, Ocean Informatics) that are building information infrastructures for the natural sciences.
Abstract: The Genescene development team has constructed an aggregation interface for automatically-extracted biomedical pathway relations that is intended to help researchers identify and process relevant information from the vast digital library of abstracts found in the National Library of Medicine's PubMed collection. Users view extracted relations at various levels of relational granularity in an interactive and visual node-link interface. Anecdotal feedback reported here suggests that this multi-granular visual paradigm aligns well with various research tasks, helping users find relevant articles and discover new information.
Abstract: While it would seem that digital video libraries should benefit from access mechanisms directed to their visual contents, years of TREC Video Retrieval Evaluation (TRECVID) research have shown that text search against transcript narrative text provides almost all the retrieval capability, even with visually oriented generic topics. A within-subjects study involving 24 novice participants on TRECVID 2004 tasks again confirms this result. The study shows that satisfaction is greater and performance is significantly better on specific and generic information retrieval tasks from news broadcasts when transcripts are available for search. Additional runs with 7 expert users reveal different novice and expert interaction patterns with the video library interface, helping explain the novices' lack of success with image search and visual feature browsing for visual information needs. Analysis of TRECVID visual features well suited for particular generic tasks provides additional insights into the role of automated feature classification for digital image and video libraries.
Abstract: This research assessed the effectiveness of selected interface tools for helping people respond to classic information tasks with webcasts. Webcasts, another form of multi-media, have little public research. Rather than focus on classic search/browse task to locate an appropriate webcast to view, our work takes place at the level of an individual webcast to assess interactivity with the contents of a single webcast. The questions guiding our work are: 1) Which tool(s) are the most effective in achieving the best response? 2) What types of tools are needed for optimum response? In this study, 16 participants responded to a standard set of information tasks using ePresence, a webcasting system that handles both live and stored video, and that provides multiple techniques for accessing content. Using questionnaires, screen capture and interviews, we evaluated the interaction holistically and found that the tools in place were not as useful as was expected.
Abstract: The MIDAS project is developing infrastructure and policies for optimal display of digital information on devices with diverse characteristics. In this paper we present the preliminary results of a study that explored the effects of scaling and color-depth variation in digital photographs on user perceptions of similarity. Our results indicate general trends in user preferences and can serve as guidelines for designing policies and systems that display digital images optimally on various information devices.
Abstract: When users seek to find specific resources in a digital library, they often use the library catalog to locate them. These catalog queries are defined as known item queries. As known item queries search for specific resources, it is important to manage them differently from other search types, such as area searches. We study how to identify known item queries in the context of a large academic institution's online public access catalog (OPAC). We also examine how to recognize when a known item query has retrieved the item in question. Our approach combines techniques in machine learning, language modeling and machine translation evaluation metrics to build a classifier capable of distinguishing known item queries with an accuracy of 72% and correctly classifies titles for whether they are the known item sought with an accuracy of 86%. To our knowledge, this is the first report of such work, which has the potential to streamline the user interface of both OPACs and digital libraries in support of known item searches.
Abstract: An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only ``entry point'' to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.
Abstract: In this paper we discuss the architecture of a tool designed to help users develop vertical search engines in different domains and different languages. The design of the tool is presented and an evaluation study was conducted, showing that the system is easier to use than other existing tools.
Abstract: Work at the University of California, Berkeley and the University of Liverpool in the UK is developing a Information Retrieval and Digital Library service system (Cheshire3) that will operate in both single-processor and ``Grid'' distributed computing environments. This short paper discusses the object architecture of the Cheshire3 system and discusses how it can be used for a variety of Digital Library tasks, and how it performs in a Grid processing environment.
Abstract: The Digital Anthropology Resources for Teaching (DART) project integrates the content acquisition and cataloging initiatives of a federated digital repository with the development of scholarly publications and the creation of digital tools to facilitate classroom teaching. The project's technical architecture and unique publishing model create a teaching context where students move easily between primary and secondary source material and between authored environments and independent research, and raise specific issues with regard to metadata, object referral, rights, and exporting content. The model also addresses the loss of provenance and catalog information for digital objects embedded in "born-digital" publications. The DART project presents a practical methodology to combine repository and publication that is both exportable and discipline-neutral.
Abstract: The aim of this paper is to report the research results of an ongoing project that deals with the exploitation of a digital archive of drawings and illustrations of historic documents for research and education purposes. According to the results on a study of user requirements, we designed tools to provide researchers with novel ways for accessing the digital manuscripts, sharing, and transferring knowledge in a collaborative environment. Annotations are proposed for making explicit the results of scientific research on the relationships between images belonging to manuscripts produced in a time span of centuries. For this purpose, a taxonomy for linking annotation is proposed, together with a conceptual schema for representing annotations and for linking them to digital objects.
Abstract: Users of modern digital libraries (DLs) can keep themselves up-to-date by searching and browsing their favorite collections, or more conveniently by resorting to an alerting service. The alerting service notifies its clients about new or changed documents. So far, no sophisticated service has been proposed that covers heterogeneous and distributed collections and is integrated with the digital library software.
This paper analyses the conceptual requirements of this much-sought after service for digital libraries. We demonstrate that the diffing concepts of digital libraries and its underlying technical design has extensive influence (a) the expectations, needs and interests of users regarding an alerting service, and (b) on the technical possibilities of the implementation of the service.
Our findings will show that the range of issues surrounding alerting services for digital libraries, their design and use is greater than one may anticipate. We also show that, conversely, the requirements for an alerting service have considerable impact on the concepts of DL design. Our findings should be of interest for librarians as well as system designers. We highlight and discuss the far-reaching implications for the design of, and interaction with, libraries. This paper discusses on the lessons learned from building such a distributed alerting service. We present our prototype implementation as a proof-of-concept for an alerting service for open DL software.
Abstract: Recommender systems can provide valuable services in a digital library environment, as demonstrated by its commercial success in book, movie, and music industries. One of the most successful recommendation algorithms is collaborative filtering, which explores the correlations within user-item interactions to infer user interests and preferences. However, the recommendation quality of collaborative filtering approaches is greatly limited by the data sparsity problem. To alleviate this problem we have previously proposed graph-based algorithms that explore transitive user-item associations. In this paper, we extend from the idea of analyzing user-item interactions as graphs and employ link prediction approaches proposed in the recent network modeling literature for making collaborative filtering recommendations. We have adapted a wide range of linkage measures for making recommendations. Our preliminary experimental results based on a book recommendation dataset show that some of these measures achieved significantly better performance than standard collaborative filtering algorithms.
Abstract: Several researchers have developed tools for classifying/ clustering Web search results into different topic areas, and to help users identify relevant results quickly in the area of interest. They mainly focused on topical categorization, such as sports, movies, travel, computers, etc. This study is in the area of sentiment classification -- automatically classifying on-line review documents according to the overall sentiment expressed in them. A challenging aspect is that while topics are often identifiable by keywords alone, sentiment can be expressed in a more subtle manner which means sentiment requires more natural language understanding than the usual topic-based classification. A prototype system has been developed to assist users to quickly focus on recommended (or non-recommended) information by automatically classifying Web search results into four categories: positive, negative, neutral, and non-review documents, by using an automatic classifier based on a supervised machine learning algorithm, Support Vector Machine (SVM).
Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Qinghua Zheng
Abstract: In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that do not belong to any specific genre, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods were proposed mainly for title extraction from research papers. It was not clear whether it is possible to conduct automatic title extraction from general documents. As case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize format information such as font size as features in the models. It turns out that the use of format information is the key to a successful extraction from general documents. Precision and recall for title extraction from Word were 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint were 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.
Abstract: In this paper, we introduce the HiBO bookmark management system. HiBO aims at extending the populated personal repositories (aka bookmarks) by automatically organizing their contents into topics, through the use of a built-in subject hierarchy. HiBO offers customized personalized services, such as the meaningful grouping and ordering of bookmarks within the hierarchy's topics in terms of the bookmarks' conceptual similarity to each other. HiBO also provides a framework that allows the user to customize and assist the categorization process.
Abstract: Automatic text classification is an important operational problem in digital library practice. Most text classification efforts so far concentrated on developing centralized solutions. However, centralized classification approaches often are limited due to constraints on knowledge and computing resources. In addition, centralized approaches are more vulnerable to attacks or system failures and less robust in dealing with them. We present a de-centralized approach and system implementation (named MACCI) for text classification using a multi-agent framework. Experiments are conducted to compare our multi-agent approach with a centralized approach. The results show multi-agent classification can achieve promising classification results while maintaining its other advantages.
Abstract: The temporal elements of users' information requirements are a continually confounding aspect of digital library design. No sooner have users' needs been identified and supported than they change. This paper evaluates the changing information requirements of users through their ‘information journey' in two different domains (health and academia). In-depth analysis of findings from interviews, focus groups and observations of 150 users have identified three stages to this journey: information initiation, facilitation (or gathering) and interpretation. The study shows that, although digital libraries are supporting aspects of users' information facilitation, there are still requirements for them to better support users' overall information work in context. Users are poorly supported in the initiation phase, as they recognize their information needs, especially with regard to resource awareness; in this context, interactive press-alerts are discussed. Some users (especially clinicians and patients) also required support in the interpretation of information, both satisfying themselves that the information is trustworthy and understanding what it means for a particular individual.
Abstract: We summarize and analyze information gained from interviews with a non-random set of NSDL awardees. One purpose was to inform the development of membership models for the NSDL, and a second purpose was to better define the nature of needed infrastructure, including capabilities for the integration of services across the library. The results of this work shed light on two aspects (one largely social, the other architectural) of the NSDL as a distributed library-building endeavor.
Abstract: The distributed, project-oriented nature of digital libraries (DLs) has made them difficult to evaluate in aggregate. By modifying the methods and tools used to evaluate traditional libraries' content and services, measures can be developed whose results can be used across a variety of DLs. The DigiQUAL protocol being developed by the Association of Research Libraries (ARL) has the potential to provide the National Science Digital Library (NSDL) with a standardized methodology and survey instrument with which to evaluate not only its distributed projects but also to gather data to assess the value and impact of the NSDL.
Abstract: This paper examines user choice of interface language in a bi-language digital library (English and Māori, the language of the indigenous people of New Zealand ). The majority of collection documents are in Māori, and the interface is available in both Māori and English. Log analysis shows three categories of preference for interface language: primarily English, primarily Māori, and bilingual (switching back and forth between the two). As digital libraries increase in number, content, and potential user base, interest has grown in ‘multilingual' or ‘multi-language' collections—that is, digital libraries in which the collection documents and the collection interface include more than one language. Research in multilingual/multi-language digital libraries and web-based document collections has primarily focused on fundamental implementation issues and functionality, principles for design, and small-scale usability tests; at present no analysis exists of how these systems are used, or how the presence of more than one language in a digital library affects user interactions—presumably because multilingual/multi-language digital libraries are only recently moving from research lab prototypes to fielded systems, and few have built up a significant usage history. This paper describes the application of log analysis to examine interface language preference in a bi-language (English/Māori) digital library—the Niupepa Collection (Section 2). Web log data was collected for a year (Section 3), and log analysis indicates three categories of interface language preferences: English, Māori, and ‘bilingual' (Section 4). A fine-grained analysis of activities within user sessions indicates different patterns of document access and information gathering strategy between these three categories (Section 5).
Abstract: In this paper, we present the design, implementation, and evaluation of a self-archiving service for the Brazilian Digital Library of Computing (BDBComp). We focus on design decisions given the specific context of the Brazilian CS community, the implementation details that follow those decisions, and the evaluation of the implemented features. For evaluation, we conducted an extensive usability experiment with several potential users, including graduate students, professors, and archivists/librarians. The results of that study and their implications for similar services are described and analyzed, following sound statistical principles. Finally, a comprehensive comparison of the features of ours and similar services is also performed.
Abstract: Our system suggests likely identity labels for photographs in a personal photo collection. Instead of using face recognition techniques, the system leverages automatically-available context, like the time and location where the photos were taken.
Based on time and location, the system automatically computes event and location groupings of photos. As the user annotates some of the identities of people in their collection, patterns of re-occurrence and co-occurrence of different people in different locations and events emerge. The system uses these patterns to generate label suggestions for identities that were not yet annotated. These suggestions can greatly accelerate the process of manual annotation.
We obtained ground-truth identity annotation for four different photo albums, and used them to test our system. The system proved effective, making very accurate label suggestions, even when the number of suggestions for each photo was limited to five names, and even when only a small subset of the photos was annotated.
Abstract: Searching photo libraries can be made more satisfying and successful if search results are presented in a way that allows users to gain an overview of the photo categories. Since photo layouts on computer displays are the primary way that users get an overview, we propose a novel approach to show more photos in meaningful groupings. Photo layouts can be linear strips, or zoomable three dimensional arrangements, but the most common form is the flat two-dimensional grid. This paper introduces a novel bi-level hierarchical layout with motivating examples. In a bi-level hierarchy, one region is designated for primary content, which can be a single photo, text, graphic, or combination. Adjacent to that primary region, groups of photos are placed radially in an ordered fashion, such that the relationship of the single primary region to its many secondary regions is immediately apparent. A compelling aspect is the interactive experience in which the layout is dynamically resizable, allowing users to rapidly, incrementally, and reversibly alter the dimensions and content. It can accommodate hundreds of photos in dozens of regions, can be customized in a corner or center layout, and can scale from an element on a web page to a large poster size. On typical displays (1024 x 1280 or 1200 x 1600 pixels), bi-level radial quantum layouts can conveniently accommodate 2-20 regions with tens or hundreds of photos per region.
Abstract: With the explosive growth of networked collections of musical material, there is a need to establish a mechanism like a digital library to manage music data. This paper presents a content-based processing paradigm of popular song collections to facilitate the realization of a music digital library. The paradigm is built on the automatic extraction of information of interest from music audio signals. Because the vocal part is often the heart of a popular song, we focus on developing techniques to exploit the solo vocal signals underlying an accompanied performance. This supports the necessary functions of a music digital library, namely, music data organization, music information retrieval/recommendation, and copyright protection.
Abstract: This paper describes several challenges that arise in designing a digital library for K12 education audiences when using Learning Object Metadata standard. These problems were multiplied when attempting to catalog the wide variety of informal learning and teaching resources from our museum's ever growing website and exhibit-based resource collections. This paper shares key challenges and early solutions for the creation of an educational metadata scheme based upon LOM, new vocabularies, and strategies for retrofitting existing informal learning science resources into learning objects.
Abstract: This paper discusses the impact of developers' and users' tacit forms of understanding on digital library development. It draws on three years of ethnographic research with the Digital Water Education Library (DWEL) that focused on the observation, collection, and analysis of the project's face-to-face and electronic organizational communication. The DWEL project involved formal and informal educators in the development of its collection, and experienced problems at the start of the project with getting these educators to complete their cataloguing tasks. The research showed that despite having spent several days in face-to-face workshops, the project's PIs and the educators had different tacit understandings of what digital libraries were, that were impeding the project's organizational communication and workflow. I describe how these differences were identified and analyzed, and subsequently addressed and mediated through the design and development of online tools that acted as boundary objects between the PIs and the educators.
Abstract: This paper describes exploratory research concerning the automatic assignment of educational standards to lesson plans. An information retrieval based solution was proposed, and the results of several experiments are discussed. Results suggest the optimal solution would be a recommender tool where catalogers receive suggestions from the system but humans make the final decision.
Abstract: In this paper, we discuss the findings of an in-depth ethnographic study of reading and within-document navigation and add to these findings the results of a second analysis of how people read comparable digital materials on the screen, given limited navigational functionality. We chose periodicals as our initial foil since they represent a type of material that invites many different kinds of reading and strategies for navigation. Using multiple sources of evidence from the data, we first characterize readers' navigation strategies and specific practices as they make their way through the magazines. We then focus on two observed phenomena that occur when people read paper magazines, but are absent in their digital equivalents: the lightweight navigation that readers use unselfconsciously when they are reading a particular article and the approximate navigation readers engage in when they flip multiple pages at a time. Because page-turning is so basic and seems deceptively simple, we dissect the turn of a page, and use it to illustrate the importance and invisibility of lightweight navigation. Finally, we explore the significance of our results for navigational interfaces to digital library materials.
Abstract: In this article we describe a system for annotating digital books in a digital library for young adult readers and discuss a first field test of the system with a small group of young adults together with their families and a few friends. We argue that most studies of digital libraries and their patrons' needs or desires look at adult users pursuing work-related goals such as collaborative writing task encountered in college or in professional work places. Most studies also examine the problems and issues of research libraries rather than public libraries. Yet children who are growing up digital may not only have different and heretofore unrecognized needs but also may have insights into the needs future library patrons may have. And public libraries, which support a broader public's reading habits, may also need different kinds of tools for their patrons. Because we are working with younger readers, we have focused on supporting active reading but in the context of reading for pleasure. Our results show that the digital library book for young adults can become a "practiced place," by which we mean a site of shared, constructed meaning through the traces that individuals' reading and writing create.
Abstract: Movable books provide interesting challenges for digitization and user interfaces design. We report in this paper some preliminary results in the building of a 3D visualization workbench for such books.
Abstract: In this article we present an evaluation of the use of text clustering and classification methods to improve digital library browse interfaces over metadata lacking a unified ontological basis. This situation is common in ``portal'' style digital libraries, which are built by harvesting content from many disparate sources, typically using the Open Archives protocol for metadata harvesting (OAI-PMH).
Abstract: The concept map, first suggested by Joseph Novak, has been extensively studied as a way for learners to increase understanding. We are automatically generating and translating concept maps from electronic theses and dissertations, for both English and Spanish, as a DL aid to discovery and summarization.
Abstract: We briefly discuss the architecture and design of a collection understanding tool that utilizes information visualization and the Open Archives Initiative Protocol for Metadata Harvesting to help users in understanding the essence of image collections in OAI-PMH compliant repositories.
Abstract: Events may be are best understood in the context of other events. Because of the temporal ordering, we can call a set of related events a timeline. Even such timelines are best understood in the context of other timelines. To facilitate the exploration of a collection of timelines and events, a visualization tool has been developed that structures the user's browsing. In this model, each event is accompanied by a text description and links to related resources. In particular, this system can provide a browsing interface of digitized historical newspapers.
Abstract: This study analyzes metadata shared by cultural heritage institutions via the Open Archives Initiative Protocol for Metadata Harvesting. The syntax and semantics of metadata appearing in the Dublin Core fields creator, contributor, and date are examined. Preliminary conclusions are drawn regarding the effectiveness of Dublin Core in the Open Archives Initiative environment for cultural heritage materials.
Abstract: One of the criticisms library users often make of catalogs is that they rarely include information below the bibliographic level. It is generally impossible to search a catalog for the titles and subjects of particular chapters or volumes. There has been no way to add this information to catalog records without exponentially increasing the workload of catalogers. This paper describes how initial investments in full text digitization and structural markup combined with current named entity extraction technology can efficiently generate the detailed level of catalog data that users want, at no significant additional cost. This system is demonstrated on an existing digital collection within the Perseus Digital Library.
Abstract: It has been nearly sixty years since Vannevar Bush's essay, “As We May Think,” was first published in The Atlantic Monthly, an article that foreshadowed and possibly invented hypertext. While much has been written about this seminal piece, little has been said about the argument Bush presented to justify the creation of the memex, his proposed personal information device. This paper revisits the article in light of current technological and social trends. It notes that Bush's argument centered around the problem of information overload and observes that in the intervening years, despite massive technological innovation, the problem has only become more extreme. It goes on to argue that today's manifestation of information overload will require not just better management of information but the creation of space and time for thinking and reflection, an objective that is consonant with Bush's original aims.
Abstract: Unlike many efforts that focus on supporting scholarly research by developing large-scale, general resources for a wide range of audiences, we at the Cervantes Project have chosen to focus more narrowly on developing resources in support of ongoing research about the life and works of a single author, Miguel de Cervantes Saavedra (1547-1616). This has lead to a group of hypertextual archives, tightly integrated around the narrative and thematic structure of Don Quixote. This project is typical of many humanities research efforts and we discuss how our experiences inform the broader challenge of developing resources to support humanities research.
Abstract: Visualizations can be used to help users make sense of a collection of documents. This paper presents two techniques for displaying document information by automatic positioning of document icons. Positional coding of information can be particularly effective as the human eye rapidly distinguishes different positions, but it can also be space-intensive.
Thus, our techniques are designed to use space efficiently. The first technique, icon abacus, displays the value of an attribute with a small set of fixed values, using displacements along a single axis, so that other metadata can be displayed simultaneously on the other axis. An icon abacus uses about half as much screen space as a three-column grid on typical data sets.
The second technique, ghost icons, shows the relationship of a document to cited or citing documents by allowing the related documents to become full members of the visualization temporarily. Both techniques allow the reader to absorb metadata rapidly, while forming and using spatial memories of the document visualization.
Abstract: This paper describes the development of practical automatic metadata assignment tools to support automatic record creation for virtual libraries, metadata repositories and digital libraries, with particular reference to library-standard metadata. The development process is incremental in nature, and depends upon an automatic metadata evaluation tool to objectively measure its progress. The evaluation tool is based on and informed by the metadata created and maintained by librarian experts at the INFOMINE Project, and uses different metrics to evaluate different metadata fields. In this paper, we describe the form and function of common metadata fields, and identify appropriate performance measures for these fields. The automatic metadata assignment tools in the iVia virtual library software are described, and their performance is measured. Finally, we discuss the limitations of automatic metadata evaluation, and cases where we choose to ignore its evidence in favor of human judgment.
Abstract: Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. Alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue are discussed here.
We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest academic papers using a focused crawler that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.
Abstract: Constructing a Chinese digital library, especially for a historical article archiving, is often bothered by the small character sets supported by the current computer systems. This paper is aimed at resolving the unencoded character problem with a practical and composite approach for Chinese digital libraries. The proposed approach consists of the glyph expression model, the glyph structure database, and supporting tools. With this approach, the following problems can be resolved. First, the extensibility of Chinese characters can be preserved. Second, it would be as easy to generate, input, display, and search unencoded characters as existing ones. Third, it is compatible with existing coding schemes that most computers use. This approach has been utilized by organizations and projects in various application domains including archeology, linguistics, ancient texts, calligraphy and paintings, and stone and bronze rubbings. For example, in Academia Sinica, a very large full-text database of ancient texts called Scripta Sinica has been created using this approach. The Union Catalog of National Digital Archives Project (NDAP) dealt with the unencoded characters encountered when merging the metadata of 12 different thematic domains from various organizations. Also, in Bronze Inscriptions Research Team (BIRT) of Academia Sinica, 3,459 Bronze Inscriptions were added, which is very helpful to the education and research in historic linguistics.
Abstract: In this paper we present our rationale and design principles for a distributed e-library of medieval chant manuscript transcriptions. We describe the great variety in early neumatic notations, in order to motivate a standardized data representation that is lossless and universal with respect to these musical artifacts. We present some details of the data representation and an XML Schema for describing and delivering transcriptions via the Web. We argue against proposed data formats that look simpler, on the grounds that they will inevitably lead to fragmentation of digital libraries. We plan to develop applications software that will allow users to take full advantage of the carefully designed representation we describe, while shielding users from its complexity. We argue that a distributed e-library of this kind will greatly facilitate scholarship, education, and public appreciation of these artifacts.
Abstract: This paper is a case study of metadata development in the early stages of the National Digital Newspaper Program, a twenty-year digital initiative to expand access to historical newspapers in support of research and education. Some of the issues involved in newspaper metadata are examined, and a new XML-based standard is described that is suited to the large volume of data, while remaining flexible into the future.
Abstract: To facilitate long-term preservation and to sustain the utility of phonograph record albums, an efficient and economical workflow management system for digitization of these important cultural heritage artifacts is needed. In this paper, we describe the digitization process of creating an online digital collection and our procedure for creating the ground-truth data, which is essential for developing an efficient metadata and content capturing system. We also discuss the challenges of defining metadata for phonograph records and its packaging to facilitate new forms of online access and preservation.
Abstract: An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies (citations). This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes.
Abstract: In this paper, we consider the problem of ambiguous author names in bibliographic citations, and comparatively study alternative approaches to identify name variants (e.g., ``Vannevar Bush'' and ``V. Vush''). Based on two-step framework, where step 1 is to substantially reduce the number of candidates via blocking, and step 2 is to measure the distance of two names via coauthor information. Combining four blocking methods and seven distance measures on four data sets, we present an extensive experimental results, and identify a few combinations that are scalable and effective.
Abstract: In this paper, we attempt to give spatial semantics to web pages by assigning them place names. The entire assignment task is divided into three sub-problems, namely place name extraction, place name disambiguation and place name assignment. We propose our approaches to address these sub-problems. In particular, we have modified GATE, a well-known named entity extraction software, to perform place name extraction using a US Census gazetteer. A rule-based place name disambiguation method and a place name assignment method capable of assigning place names to web page segments have also been proposed. We have evaluated our proposed disambiguation and assignment methods on a web page collection referenced by the DLESE metadata collection. The results returned by our methods are compared with manually disambiguated place names and place name assignment. It is shown that our proposed place name disambiguation method works well for geo/geo ambiguities. The preliminary results of our place name assignment method indicate promising results given the existence of geo/non-geo ambiguities among place names.