The Biological Information Browsing Environment (BIBE) is a facility to help novices and experts find information about plants and animals in digital collections. The Project is funded by the National Science Foundation Biological Databases and Informatics Program and is a collaboration between the Graduate School of Library and Information Science at the University of Illinois, Illinois Natural History Survey, Missouri Botanical Garden, Flora of North America Project at the Harvard Herbarium, and Illinois Department of Natural Resources.
The objectives of the Project are to facilitate access to online flora and fauna by both novices and experts through enhanced indexing, searching, and visualization techniques. Specific search facility and content will be added to help users with different levels of domain knowledge identify species based on the augmentation of professionally developed taxonomic treatments or species descriptions. This is a novel use of taxonomic descriptions. In the course of development the system will undergo a series of qualitative and quantitative evaluations and re-designs by several communities of users, including professional entomologists and botanists as well as citizen scientists performing biodiversity surveys.
ContentsInformation system use and design: EcoWatch and PLAN-IT Programs
Participatory design and evaluation
Immersive design techniques in the BIBE Project
Features of the system
Information System Use and Design: EcoWatch and PLAN-IT Programs
A system's success can only be measured by its usefulness or "fit" within a particular context of use. We set out to design and test BIBE with groups of high school students who conduct biodiversity surveys under guidelines developed for the Illinois EcoWatch program and the PLAN-IT teacher enhancement program. A biodiversity survey is a count of the number of species in an area. It is a critical activity for land management and environmental change research. This relatively formal setting of a pre-established information environment helps to define a clear set of well-defined and pressing information needs. Students and adult volunteers need to perform many tasks as part of EcoWatch but one of the most challenging tasks is species identification. Participants need to be able to identify hundreds of plant and insect species. In this Project, we study how participants learn to identify species and when they fail. This understanding will allow us to design a system that allows for more efficient and accurate species identification.
Three hundred and forty Illinois high school teachers from over 300 different schools have been trained in at least one ecosystem monitoring program (RiverWatch, ForestWatch, PrairieWatch, WetlandWatch) through the PLAN-IT EARTH partnership. Many have been trained in multiple ecosystems. As numbers are cumulative, we could conservatively estimate that the EcoWatch program has reached between 15,000 and 20,000 students in classes taught by these teachers.
Participatory Design and Evaluation
Many systems designers and researchers have come to recognize the need for a more complete understanding of the use of information in a social context when designing systems. Nardi and O'Day (1999) use the phrase "information ecology" to capture the notion that social practices, user goals, technology, and information are co-dependent. Any change in one will have impacts throughout an organization or society. In the BIBE Project we have adopted a number of techniques ranging from qualitative participatory design to system performance evaluation to understand the information needs of citizen scientists performing biodiversity surveys.
BIBE is but one new tool in a pre-established information-use community. If it is to work well in the context of its use, we need to develop a good understanding of that context. To this end, we are gathering information from all of the user groups in the program, including the EcoWatch program initiators, high school teachers and students, and adult EcoWatch volunteers. Each of these constituents of the user community participate in a number of participatory design activities.
The species identification focus of the project arises out of needs identified by the program originators. Dr. Michael Jeffords from the Illinois Natural History Survey was a key developer of EcoWatch procedures and training, particularly ForestWatch. Through in-classroom training and feedback from participants in-person and by e-mail, he was able to verify that EcoWatch participants have a high level of concern about the accuracy of their identifications. For example, below is an excerpt from e-mail sent by a teacher to the EcoWatch Coordinator in Springfield, Illinois."The teacher said that his kids knew all 18 taxa perfectly well in the classroom (show 'em a slide or picture, and they knew which one it was etc.), but then in the field it is totally different. They do not get a good look at the butterflies to confidently ID them. Another teacher said it may seem fine for experts, but the experts need to consider that we volunteers only have so much time to learn the various procedures. 'We can't be an expert at everything.'"
The same program director remains convinced that the volunteers are capable of identifying the species. From a later e-mail,"I have had many volunteers tell me that they have very little confidence in their ability to ID the butterfly indicators. This appears to be especially difficult for schools, but adult volunteers have also told me this. I DO NOT BELIEVE THIS. MOST INDIVIDUALS CAN IDENTIFY 18 OF ANYTHING, IF THEY ARE INTERESTED ENOUGH TO LEARN THEM. MY 8-YEAR OLD DAUGHTER KNOWS AT LEAST 30 DIFFERENT AMERICAN GIRL DOLLS AND ALL THEIR ACCESSORIES! ANY TEENAGER CAN IDENTIFY INUMERABLE (sic) SONGS THAT ALL SOUND THE SAME TO ME! I REALLY THINK THIS IS A TRAINING PROBLEM." [Bolding in original.]
This impression by volunteers is warranted by verification procedures that are built into the EcoWatch process. Professional biologists verify the survey data for a sample of those conducted. A good example of this process is a verification of the RiverWatch data conducted by Edward DeWalt of the Illinois Natural History Survey (DeWalt, 2000). DeWalt reports a high correlation between RiverWatch volunteer diversity estimates and professional estimates. Errors in identification were relatively low, although some additional training is needed in some groups. Still, volunteers are frequently anxious about these error rates and everyone associated with the project would like to improve them. A key design objective of BIBE is increased identification accuracy and user feedback. An informal survey of teachers indicates that many, while following EcoWatch procedures in the classroom, do not submit data to the EcoWatch program.
Immersive design techniques in the BIBE Project
The first method employed was to conduct informal interviews with over a dozen experts and citizen scientists to help set the scope for further study. These interviews included directed discussions with high school teachers, EcoWatch trainers administrators, and Missouri Botanical Garden professional staff.
One method to gain a better understanding of user needs is to become a user (Beyer & Holtzblatt, 1995). To this end, Bryan Heidorn, the Principal Investigator on the Project, received ForestWatch training and became a certified volunteer in the spring of 2000. Likewise, Mary Lokhaiser, a doctoral student on the Project, received ForestWatch and PrairieWatch training in the summer of 2000. These experiences serve as a first-hand source of understanding of the information needs of participants, including not only plant and insect identification information but good sources of treatments for poison ivy, insect bites, and sunburn.
Another method of study being used in BIBE design is non-intrusive classroom observation. In the fall of 2000, Project staff attended classroom instruction of high school students for ForestWatch. High school teachers were also monitored in their integration of EcoWatch training into high school curricula. BIBE staff were non-participating observers, keeping track of information resources used in the classroom and subsequent field work. Some of this material is provided as part of the EcoWatch and PLAN-IT programs and other information is derived from sources such as field guides.
While it is valuable to become a user to understand user needs, it is sometimes difficult to observe what is happening when you are part of the process. For this reason, the design team will act as non-participant observers in a number of EcoWatch field surveys. The researchers make field notes and videotape the surveys. The first of these sessions was held in the fall of 2000; additional surveys will be recorded in the spring and fall of 2001. These observations will help us to understand the information behavior and information problems encountered by participants in the field. Some of the analyses of these observations are guided by Pankhurst's (1993) observations about the user perspective on species identification. Pankhurst observes that the simplest and fastest way to identify a species is just to know what it is. What percentage of the time do high school EcoWatch participants "just know" the species of a tree in a survey? How often are they correct when they first encounter a species during a survey? How often do individuals correctly identify species on their own by the end of a survey? Are there differences between adult volunteers and students in these parameters? Pankhurst goes on to identify the next best method of species identification after just knowing - stand next to someone who knows! In our study we will use the video to analyze communication between participants. Who communicates to whom? What information do they volunteer? What questions do they ask? What answers do they get and how secure are the answers? When experts are available, will students gravitate towards the experts and ask for assistance? So as not to interfere unnecessarily with our observations, observers will kindly decline to answer questions about identification or EcoWatch procedures.
Features of the system
The design techniques described above will be used to identify new features for the system. This includes input from professional botanists and entomologists, citizen scientists, teachers, and students. It will include direct participation by the developers, field observation, focus groups, and monitored observation and experimentation with the new system. Based on input from teachers, EcoWatch professionals, and a user vocabulary study (Heidorn & Cui, 2000) there are a number of features that are planned or under development. The techniques are derived from information retrieval research and adapted to the identification and training tasks. The techniques to be used include information space browsing, XML recoding, phrase extraction, information extraction, and similarity-based retrieval (relevance feedback). Figure 2 is a snapshot of the current version of the interface, in this case a search of the Flora of North America (Flora of North America Editorial Committee, 2000).
Figure 2: BIBE Interface
In the following discussion a specimen identification task is redefined as an information retrieval task. The user identifies a specimen by retrieving species descriptions from a standard flora or fauna. In the BIBE model an identification retrieval task is a two-step process. An initial query is executed to filter the data to produce a resultant set of species descriptions. This initial filtering step may produce hundreds of species that are then visualized or explored using the interface as depicted above. In this case each document is a description of a tree or butterfly species. The "trees" at the corners of the triangle are points of interest (POI). The searcher provides these points to organize the set. A user may place the POIs anywhere on the screen. In this case they represent three independent queries. These points specify the facets of a data set that will serve as the basis for similarity functions between the "visualization query" and individual species descriptions. The open rectangles on the display are object description icons representing individual species descriptions.
The location of species icons on the screen is determined by the relative similarity between the species description and all of the POIs. Species descriptions that are more like a point of interest are attracted to that point on the computer screen. A user may click on a rectangular document icon to see the entire database record for the species that the icon represents. Figure 3 is an image of a species description from Butterflies of North America (Opler, Pavulaan, & Stanford, 1995). This interface is very much like the visualization technique for text documents used in VIBE (Olsen et al., 1993; Morse & Lewis, 1997). A discussion of this display type vs. traditional result lists is presented elsewhere (Heidorn & Cui, 2000).
Figure 3: Document Display
In this Project we are defining a new measure of similarity based on the structure of the underlying data. This information is embedded in the text and extracted using information extraction as discussed below.
In the next version of BIBE, individuals will be able to search for words inside particular fields. We expect task performance to improve if users can use these facets in query formulation and browsing. Documents such as the Flora of North America and the Butterflies of North America are composed of discrete parts such as range and wing span. Scientific journal articles are composed of a bibliographic section (author, title, and publication), abstract, introduction, methods, results, and other elements. The adoption of XML standards will increase the prevalence of databases with this type of faceted structure.
A POI may specify the value of any set of facets (or fields) of a data set. In the Flora of North America (FNA), the Leaf facet was defined to equal the value of the leaf field of another species. The value could have easily been set to a simpler function such as "Leaf = convex" and "Leaf = planar" or "petiole length = 7mm." One POI could also specify the value of multiple facets such as the "Leaf" and "Inflorescence" or flower arrangement facet.
To add this capability we need to develop three components. These include full text indexing of XML documents, automated XML markup of semi-structured documents, and graphical user interface support for the new functionality. We have already altered a version of the GNU Public License software, Swish-e (http://sunsite.berkeley.edu/SWISH-E/) to support XML fielded queries. This feature is currently being added to the BIBE interface.
Several approaches are being taken to developing an XML-based taxonomic collection. First, current online collections do not contain all of the species that are monitored in our test environment EcoWatch. For that reason we are developing our own taxonomic descriptions for those species. These new files are being written using standards under development by the Taxonomic Database Working Group (TDWG) of the International Union for Biological Sciences. This standard - Structure of Descriptive Data (SDD) - is currently under development. An example document may be found in the Appendix A.
A second approach is to markup "legacy" data such as the Flora of North America using software filters prior to indexing. These filters will markup the text in a relatively coarse units such as <identification>, <morphology>, and <references>. In order to make the information in these files truly useful, it is necessary to use more advenced techniques to extract the information.
Knowledge-based data extraction: The vector-based statistical (word frequency) approach discussed above may show favorable results but it does not take advantage of all of the information that is available in the descriptions because words are treated independently. For example, below is the leaf description of Calycanthus occidentalis from FNA. A POI with the Leaf facet set to "Lanceolate" would attract this description because "lanceolate" occurs twice.Calycanthus occidentalis - Shrubs, to 4 m. Lateral bud exposed. Petiole 5-10 mm, pubescent to glabrous. Leaf blade ovate-lanceolate to oblong-lanceolate or ovate-elliptic, 5-15 _ 2-8 cm, base rounded to nearly cordate, apex acute to obtuse; surfaces abaxially green, pubescent to glabrous. Flowers: hypanthium campanulate or ovoid-campanulate at maturity, 2-4 _ 1-2 cm; petals linear to linear-spatulate or ovate-elliptic, 2-6 _ 0.5-1 cm, apex rounded; stamens 10-15, linear to oblong-linear. 2n = 22 (FNA, Vol 2)
In the knowledge-based approach three new processes are added to the system. These include the definition of data structures to organize content, parsers to fill the structures for different databases, and the definition of similarity functions. In this example a data structure might be created to store shape ranges, width and length minimum and maximum, base and apex angle, and other features typically found in the Leaf facet. Ideally, the data is converted to a non-text format. For example, "base rounded" would be converted to an angle in degrees as would "nearly cordate". Linguistic labels are inappropriate because of the nature of the similarity functions as discussed below. A substantial challenge is to create parsers to fill the data structures. Since this is a relatively confined domain the parsers may be semantically driven as in prior work (Heidorn, 1997). It is impractical to attempt to build a parser that can handle the syntax of all flora and fauna. The research question here is not, "can a generalized parser be constructed?" but "can minimal knowledge-based parsing improve data representation enough to result in meaningful improvements in the identification task and usability of online flora and fauna?" The parsers will be designed to extract data from the primary databases used in this Project (FNA, Illinois Trees, Butterflies of Illinois, and CalFlora).
Traditionally, individuals identify species by using keys that allow them to describe a specimen one characteristic at a time. Printed keys force a user to specify the characteristics in a particular order. Interactive keys allow the characteristics to be specified in any order but still one at a time. In prior work we demonstrated that people prefer to describe flowering plants by noting what other plants they are like (Heidorn, 2000). For example, a gardener may say that the flower of a black-eyed susan looks like a daisy, or an orchid looks like a butterfly. With a little reflection we can see that this is a much more efficient mechanism for specifying what a plant looks like. When we say one plant looks like another we are specifying many characteristics at one time. We plan on supporting this type of similarity-based description. When a volunteer finds a species that is similar to the one they are looking for, they will be able to convert this species into a point of interest for retrieval, activating parts of the description to be used in similarity calculations. For example, the student may use the XML markup fields to set up a POI that attracts other species descriptions based on the flower description and location. Similarity will be defined differently depending on the fields selected. The similarity of any field with a numeric size range can be defined as the proportion of overlap between the ranges. The similarity between unstructured text descriptions can be defined using the cosine measure from information retrieval.
About the Author
P. Bryan Heidorn is Assistant Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.
This paper is based upon work supported by the National Science Foundation under Grant No. DBI-9982849, REU#1-5-50916 and the Research Review Board of the University of Illinois at Urbana-Champaign.
Major ParticipantsParticipants in this Project include the following, all at the University of Illinois at Urbana-Champaign:
Hong Cui, Doctoral student, Graduate School of Library and Information Science.
Zeeshan Farees, Senior in Computer Science.
Dr. Bryan Heidorn, Assistant Professor, Graduate School of Library and Information Science.
Wesley J. Janik, Senior in Computer Engineering.
Dr. Michael Jeffords, Professional Scientist, Public Relations and Education Liaison, Illinois Natural History Survey, Associate Professor, Natural Resources and Environmental Sciences, College of Agricultural, Consumer, and Environmental Sciences.
Mary F. Lokhaiser, Doctoral student, Department of Natural Resources and Environmental Sciences.
Marija Markovic, Doctoral Student, Department of Linguistics.
Bharat Mehra, Doctoral Student, Graduate School of Library and Information Science.
H. R. Beyer and K. Holtzblatt, 1995. "Apprenticing with the customer," Communications of the ACM, volume 38, number 5, pp. 45-52. http://dx.doi.org/10.1145/203356.203365
R. E. DeWalt, 2000. "Critical trend," EcoWatcher, volume 2, number 4, at http://dnr.state.il.us/orep/inrin/ecowatch/news/EcoWatch_vol2n4/index.html
Flora of North America Editorial Committee, 2000. "Flora of North America," at http://hua.huh.harvard.edu/FNA/
P. B. Heidorn, 1997. "Natural Language Processing of Visual Language for Image Storage and Retrieval," Ph.D. Dissertation, University of Pittsburgh (available from UMI, Ann Arbor, Mich.).
P. B. Heidorn and H. Cui, 2000. "The Interaction of Result Set Display Dimensionality and Cognitive Factors in Information Retrieval Systems," Proceedings of the Annual Meeting of the American Society for Information Science (ASIS 2000), Chicago, 13-16 November, pp. 258-270.
E. Morse and M. Lewis, 1997. "Why information visualizations sometimes fail," Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Orlando, Fla., 12-15 October.
B. A. Nardi and V. O'Day, 1999. Information Ecologies: Using Technology With Heart. Cambridge, Mass.: MIT Press.
K. A. Olsen, R. R. Korfhage, K. M. Sochats, M. B. Spring, and J. G. Williams, 1993. "Visualization of a document collection: The VIBE system," Information Processing and Management, volume 29, number 1, pp. 69-81. http://dx.doi.org/10.1016/0306-4573(93)90024-8
P. A. Opler, H. Pavulaan, and R. E. Stanford (coordinators), 1995. Butterflies of North America. Jamestown, N.D.: Northern Prairie Wildlife Research Center, Web site, http://www.npwrc.usgs.gov/resource/distr/lepid/bflyusa/bflyusa.htm (Version 17AUG2000).
R.J. Pankhurst, 1993. "Principles and problems of identification," In: R. Fortuner (editor). Advances in computer methods for systematic biology. Baltimore: Johns Hopkins University Press, pp. 125-136.
<!DOCTYPE Butterfly SYSTEM "bfly.dtd">
<Gender>Female and Male</Gender>
<Source>Butterflies of Illinois</Source>
<SourceDescription>Bouseman, J. and Sternburg, J. (2001). Field guide to butterflies of Illinois. Champaign, IL: Illinois Natural History Survey.</SourceDescription>
<topside>front wing bright iridescent red with black spots and border, hind wing brown with orange border</topside>
<underside>frontwing red with black spots and dark brown border, hindwing pale buff white with black spots and a wavy red submarginal line</underside>
<flighthabits>males perch awaiting passing females, dart out at passing insects,</flighthabits>
<habitat>open sunny areas with low vegetation, gardens, lawns, mowed areas, bare ground</habitat>
<ContributorName>Mary F. Lokhaiser</ContributorName>
<ContributorDetails>607 East Peabody Drive, Champaign, IL 61820, 217-244-9407, email@example.com</ContributorDetails>
Paper received 20 November 2000; accepted 31 December 2000; revised 30 January 2001.
Copyright ©2001, First Monday
A Tool for Multipurpose Use of Online Flora and Fauna: The Biological Information Browsing Environment (BIBE) by P. Bryan Heidorn
First Monday, volume 6, number 2 (February 2001),