Digital Library: IMLS Grant

In 1999, the Ewell Sale Stewart Library received a National Leadership Grant from the Institute of Museum and Library Services (IMLS) to digitize collections that document the establishment of a distinctly American approach to natural science during the first half of the nineteenth century and make them available on the web. See American Natural Science in the First Half of the Nineteenth Century to access the images resulting from this project. A description of the project is presented below.

We would very much like to have feedback on this project —its utility, ease of use, and any other comments are appreciated. Please send comments to , Project Manager.

Project Description

IMLS Logo

IMLS is an independent federal agency that lends support to the nation's museums and libraries. The $151,863 grant was awarded to the Academy Library to "preserve and enhance access to unique library resources to the broader community; address the challenges of preserving and archiving digital media", and to develop "standards, techniques, or models related to the digitization and management of digital resources."

The Library has digitized printed texts and illustrations, manuscripts, and drawings from its extensive print and archive collections. The original watercolor shown here, drawn by Titian Peale, is an example of items digitized from the Academy's archive collection. Titian Peale, son of Charles Willson Peale, was an avid naturalist and an accomplished artist in his own right. He was responsible for many of the illustrations in early Academy works.

Original illustrations by Lucy Say

The goal was to provide access to primary materials through the World Wide Web by the end of the two-year project. The project also provides funding for conservation of the scanned materials in their original formats. The project integrates digital images and texts into the Library's online system, and at the same time, preserve in digital format rare, often unique, physical artifacts. The materials, once digitized and cataloged in MARC format, are accessible through the Library's Web-based OPAC (Online Public Access Catalog).

The work of the project was done in-house by staff hired with IMLS grant funds, under the general direction of the Library Director and the daily supervision of the Information Services Librarian, in consultation with the Manuscripts/Archives Librarian and the Cataloger. The Academy selected and acquired the equipment and software required for creating digitized images and texts, by establishing in 1998 the Albert M. Greenfield Digital Imaging Center for Collections, which is managed by the Library. IMLS funds shared the costs of conserving the original materials and of adding the imaging software module to the Library's integrated online system to manage and provide intellectual access to the materials and their images. The conservation work has been carried out by a local conservation center.

top of page

Objectives

The stated objectives will explain some of the decisions we made, including the use of Adobe Acrobat for text files, and the use of an OPAC (Online Public Access Catalog) for Web delivery.

top of page

The Image Capture Workstation

Phase One Workstation

The workstation consists of the Phase One PowerPhase scanning back on a Hasselblad medium format camera and a copy stand made by TTI in New Jersey. The room is painted gray and has no windows. When scanning, the overhead lights are turned off, since the ICC Color Profile for the camera is based on the quartz halogen lights only. Keeping the lights on adversely affects color accuracy. The overhead light fixtures are set to 5000K, for daylight conditions. 5000K is a neutral light that does not give a yellow or blue cast. This helps when examining images for quality control.

Note the foam cradle on the bottom right corner of the photograph above. Such cradles are used to support folio-sized books.

A Linotype-Hell (now Heidelberg) OPAL ultra flatbed scanner is used for all single-sheet items, most of which were archival materials. The OPAL ultra is a high-end scanner with a resolution of up to 1200 dpi (without interpolation). Color accuracy is very good.

top of page

Book cradles

Wooden cradle

Three cradle types are used for different books. Foam triangles are used for folio volumes; an upright cradle provides support for books with stiff bindings; and a wooden cradle, as used by the University of Virginia, provides varying degrees of support for smaller books with flexible bindings.

A book is placed on the copy stand perpendicular to the lights to avoid shadows in the gutter. The book is then scanned page by page in one direction, capturing all odd or even numbered pages. The book is then reversed to capture the opposite pages.

An early version of the Phase One software would not rotate images. It was necessary to place the first set of scans in one folder, and the second set in another folder, then rotate in Photoshop as batch processes.

top of page

Procedures and Guidelines

Capturing images

Capturing images is done with the concept in mind that we can afford to capture each object "once in a generation", because of resource and cost issues, so our goal is to "do it right the first time". In addition, technology improves so quickly that what we consider today to be large image files (100MB) may seem primitive in a few years, as processing speeds and bandwidths increase. On the other hand, we must be practical and not capture images so large that they are cumbersome, slow to process, and difficult to store using today's technology. We must balance current resources (staffing and equipment) with technological advances. A case in point is the 700+ MB file size of the Phase One FX, at 48 color bit depth. Files of this size cannot be burned to CD, and would take minutes to open, reduce in size, move to a server, convert to another color profile, and so on. Costs would quickly become prohibitive. Images at 600 dpi magnify substantially, and more often than not serve as adequate surrogates to the original.

A second reason for capturing images in color and at high resolution is we are not sure of all the purposes our images may fulfill in the future. We currently use them for Web access, but also for reproduction in high-end publications, for posters, brochures, and other products.

top of page

File Names and File Paths

No more than 8 digits are used for folder and file names. Letters are all lower case. Letters, numbers, and the underscore are the only characters that can be used. The file extension is required. Following these guidelines allows our images to be viewed on older UNIX or DOS computers. Catering to the lowest common denominator gives us the most flexibility in providing access to as many people as possible.

The call number (published materials) or collection number (archival materials) is used as part of the file path, as well as volume numbers for multi-volume works or folder numbers for archival materials. Examples are:

The file name is constructed with two four-digit sections. The first four digits are sequential and use leading zeros to keep the images in the order in which they appear in the book or archival collection. The second four digits refer to page number. If pages are numbered, the actual page number is used. If not, we use the following codes:

File name conventions for
non-numbered pages
file name type of page
2ef 2nd endpaper, front of book
1eb 1st endpaper, back
cf Cover, front
cb Cover, back
sp Spine
tp Title page
tpv Title page verso
3u 3rd unnumbered page
3r Page 3, roman numeral(iii)
3p 3rd plate
3pv 3rd plate verso

File name examples are:

File name examples
file name description of page
0010003p 10th consecutive page, 3rd plate
001103pv 11th consecutive page, 3rd plate verso
01230ef3 123rd consecutive page, 3rd back endpaper
0005002r 5th consecutive page, roman numeral 2(ii)

top of page

Color Management

The generic ICC Color Profile for the Phase One PowerPhase scanning back is very accurate. We do NOT embed the profile in the image, but burn a copy to each CD of archived image files. A "Read Me" file explains how to apply the ICC Profile once a specific use or product is known. At this point, the ICC Profile for the Phase One scanning back is applied, and then converted to the destination ICC Profile.

To avoid changing the color profile of the images accidentally (until burned to CD), images are opened in Photoshop 5.0 or 5.5 as follows: From the File Menu go to Color Settings. Set "Assumed Profiles" to "None" and "Profile Mismatch Handling" to "Ignore". Make sure the "Embed Profile" checkboxes are unchecked. When these instructions are followed, no color management is applied to the image and it remains unchanged.

top of page

Processing for the Web

Published Text

Images are converted to grayscale, cropped, than converted to what Photoshop labels "Bitmap", or bitonal. If the conversion to bitmap results in illegibility, use Levels, AutoLevels, or Threshold to correct the problem. Other tools may also be used, including the Eraser, Magic Wand, and Brushes of various sizes. Other useful tools are the Noise filters, specifically the Despeckle and Dust & Scratches filters. Converting to grayscale can be done as a batch process. Cropping can sometimes be done as a batch if the cropping is even in all images and can be uniformly selected, using Photoshop's Magic Wand tool. No resizing is done on the text images of older books. Anything less than 300 dpi does not always give good results once imported into Adobe Acrobat.

Next, bi tonal TIFF images are imported into Adobe Acrobat. Acrobat handles monitor display size, navigation, and print size very well. The use of Acrobat, a proprietary software, is controversial. However, we chose to use it because it has become a de facto standard; is familiar to most people who browse the Web; the viewer is free to users; images display well; and print size is handled automatically by the software. In addition, and importantly, processing text images using Acrobat is easy for staff to learn.

top of page

Illustrations, Published or Original

All illustrations are converted to color JPEGS. We initially converted black and white lithographs to grayscale but soon realized that there was little difference in file size between grayscale and color images. All color JPEGS are downsized from the original high resolution TIFF images to 75 dpi. Next, color profile conversion is executed and the color bars in each image are checked for accuracy. Images are cropped to just inside the page edges. Vertical images are set to 400 pixels in width, and horizontal images to 500 pixels in width. Some older monitors display at 640 pixels across. With image size set to 500 or 600 pixels wide, users will not have to scroll across the screen. We also make sure that the output size is less than 8 1/2 inches wide by 11 inches high for the purposes of printing.

If users want to see high-resolution versions of images, they may contact us. By supplying low-resolution images on the Web, we are able to control how our images are used. Digital watermarks have not been applied to any of the images.

top of page

Manuscripts

Items from the Archives and Manuscripts collection have been digitized, but not yet processed for the Web.

Providing Web Access

As noted earlier, we are a non-profit library with a small staff and limited computer support. We do not have the resources or expertise to support SGML, SQL, or other computer languages or programs that require specialized skills and time. Data entry must be simple. We decided on linking images to our online catalog, using Millennium Media from Innovative Interfaces, Inc. Millennium Media has advantages in our situation:

Disadvantages of the OPAC solution are:

top of page

Optical Character Recognition (OCR)

OCR has not been applied to the text images. The type face is 150 to 200 years old, with inconsistencies in how the ink was applied - sometimes light, sometimes heavy. Stains and foxing complicate the issue. OCR would have been ineffective, and we did not have the funding to send materials out for re-keying. Access is provided by author, title, subject, and keyword searching at the collection level, rather than full-text.

top of page

Administrative Metadata

Administrative metadata is the data which describes the digital object: when it was scanned (date and time); by whom; using which equipment; file size and resolution; and rights and permissions. The Dublin Core is not detailed enough to provide information at this level, and there are no turn-key products yet available that manage this kind of data specific to library materials. We created a FileMaker Pro database to manage this data, but hope at some point to transfer the data to an XML or other international standardized format. There is currently a great deal of work going on in this arena, and once standards are finalized vendor products will appear. Several important projects in the area of digital image metadata are:

top of page

Lessons Learned

Allow for more time than calculations predict. Unexpected events will slow the project.

Older books are difficult, especially if stained or foxed. The digital images require extra processing, using such tools as Levels, AutoLevels, and Threshold functions in Adobe Photoshop. Once an image is changed to bitonal, stains and foxing can interfere with the legibility of the image if not removed initially.

In this example, a foxed page (left) is converted to bitonal (right) with no modifaction. The black and white image is difficult to read.

foxed image unprocessed

Below, AutoLevels are applied (left), which evens out the spread of highlights and shadows. Now, when converted to bitonal (right), the text is much more readable. Further refinements can be made for even greater legibility.

foxed image with photoshop autolevels applied

Each book is unique and presents unique challenges. Each must be evaluated and processed differently. Factors include page size; plate size; illustrations in color or black and white; amount of detail in an illustration; font size; type inconsistencies; foxing; type showing through a page; tight or loose book bindings; gutter size; margin size; handwritten notes in margins; author's signature; and so on.

We started off with the original version of Innovative's image management software, then switched to the beta version of the new product mid-stream. Workflow changed, processing procedures changed, image formats were changed from TIFF to JPEG, and bugs in the software all slowed the process.

Currently, our image metadata is stored in a FileMaker Pro database. Once metadata standards for digital images have been finalized and become stable, we plan to transfer the data from the current database.

Quality control takes time.

All steps take longer when file sizes are large. Burning to CD, processing for the Web, moving images to another computer or server —all take longer.

Originals scan better than transparencies, resulting in more accurate color and clarity of detail.

top of page

Equipment

Digital Image Acquisition Workstation for Scanning and Copy Work

Digital Image Acquisition and Image Processing Workstation

Dedicated Server, Image Archiving, and Central Storage

top of page