Digital Library: IMLS Grant
In 1999, the Ewell Sale Stewart Library received a National Leadership Grant from the Institute of Museum and Library Services (IMLS) to digitize collections that document the establishment of a distinctly American approach to natural science during the first half of the nineteenth century and make them available on the web. See American Natural Science in the First Half of the Nineteenth Century to access the images resulting from this project. A description of the project is presented below.
We would very much like to have feedback on this project —its utility, ease of use, and any other comments are appreciated. Please send comments to , Project Manager.
Project Description
IMLS is an independent federal agency that lends support to the nation's museums and libraries. The $151,863 grant was awarded to the Academy Library to "preserve and enhance access to unique library resources to the broader community; address the challenges of preserving and archiving digital media", and to develop "standards, techniques, or models related to the digitization and management of digital resources."
The Library has digitized printed texts and illustrations, manuscripts, and drawings from its extensive print and archive collections. The original watercolor shown here, drawn by Titian Peale, is an example of items digitized from the Academy's archive collection. Titian Peale, son of Charles Willson Peale, was an avid naturalist and an accomplished artist in his own right. He was responsible for many of the illustrations in early Academy works.
The goal was to provide access to primary materials through the World Wide Web by the end of the two-year project. The project also provides funding for conservation of the scanned materials in their original formats. The project integrates digital images and texts into the Library's online system, and at the same time, preserve in digital format rare, often unique, physical artifacts. The materials, once digitized and cataloged in MARC format, are accessible through the Library's Web-based OPAC (Online Public Access Catalog).
The work of the project was done in-house by staff hired with IMLS grant funds, under the general direction of the Library Director and the daily supervision of the Information Services Librarian, in consultation with the Manuscripts/Archives Librarian and the Cataloger. The Academy selected and acquired the equipment and software required for creating digitized images and texts, by establishing in 1998 the Albert M. Greenfield Digital Imaging Center for Collections, which is managed by the Library. IMLS funds shared the costs of conserving the original materials and of adding the imaging software module to the Library's integrated online system to manage and provide intellectual access to the materials and their images. The conservation work has been carried out by a local conservation center.
Objectives
- Provide access via the Web to a world wide audience and clientele.
- Provide a model that small, non-profit libraries with limited staff and computer resources may use.
The stated objectives will explain some of the decisions we made, including the use of Adobe Acrobat for text files, and the use of an OPAC (Online Public Access Catalog) for Web delivery.
The Image Capture Workstation
The workstation consists of the Phase One PowerPhase scanning back on a Hasselblad medium format camera and a copy stand made by TTI in New Jersey. The room is painted gray and has no windows. When scanning, the overhead lights are turned off, since the ICC Color Profile for the camera is based on the quartz halogen lights only. Keeping the lights on adversely affects color accuracy. The overhead light fixtures are set to 5000K, for daylight conditions. 5000K is a neutral light that does not give a yellow or blue cast. This helps when examining images for quality control.
Note the foam cradle on the bottom right corner of the photograph above. Such cradles are used to support folio-sized books.
A Linotype-Hell (now Heidelberg) OPAL ultra flatbed scanner is used for all single-sheet items, most of which were archival materials. The OPAL ultra is a high-end scanner with a resolution of up to 1200 dpi (without interpolation). Color accuracy is very good.
Book cradles
Three cradle types are used for different books. Foam triangles are used for folio volumes; an upright cradle provides support for books with stiff bindings; and a wooden cradle, as used by the University of Virginia, provides varying degrees of support for smaller books with flexible bindings.
A book is placed on the copy stand perpendicular to the lights to avoid shadows in the gutter. The book is then scanned page by page in one direction, capturing all odd or even numbered pages. The book is then reversed to capture the opposite pages.
An early version of the Phase One software would not rotate images. It was necessary to place the first set of scans in one folder, and the second set in another folder, then rotate in Photoshop as batch processes.
Procedures and Guidelines
Capturing images
Capturing images is done with the concept in mind that we can afford to capture each object "once in a generation", because of resource and cost issues, so our goal is to "do it right the first time". In addition, technology improves so quickly that what we consider today to be large image files (100MB) may seem primitive in a few years, as processing speeds and bandwidths increase. On the other hand, we must be practical and not capture images so large that they are cumbersome, slow to process, and difficult to store using today's technology. We must balance current resources (staffing and equipment) with technological advances. A case in point is the 700+ MB file size of the Phase One FX, at 48 color bit depth. Files of this size cannot be burned to CD, and would take minutes to open, reduce in size, move to a server, convert to another color profile, and so on. Costs would quickly become prohibitive. Images at 600 dpi magnify substantially, and more often than not serve as adequate surrogates to the original.
A second reason for capturing images in color and at high resolution is we are not sure of all the purposes our images may fulfill in the future. We currently use them for Web access, but also for reproduction in high-end publications, for posters, brochures, and other products.
- All pages of each book are captured, including endpapers, blank pages, and plate versos. Spines and covers are also captured. For archive materials, both sides of each sheet are scanned.
- Kodak color and gray scales are used in all digital captures, whether the original item is in black and white or color.
- Text is captured at 300 dpi, and in color. This assumes that intellectual content is more important than graphic representation when it comes to text. [400 dpi is often recommended. However, the Phase One software interpolates at 400 dpi, which is undesirable as interpolation often results in misinformation.]
- Plates and illustrations are captured at high resolution or 600 dpi.
- All handwritten manuscripts are scanned at 600 dpi.
File Names and File Paths
No more than 8 digits are used for folder and file names. Letters are all lower case. Letters, numbers, and the underscore are the only characters that can be used. The file extension is required. Following these guidelines allows our images to be viewed on older UNIX or DOS computers. Catering to the lowest common denominator gives us the most flexibility in providing access to as many people as possible.
The call number (published materials) or collection number (archival materials) is used as part of the file path, as well as volume numbers for multi-volume works or folder numbers for archival materials. Examples are:
- ql45u6/v2/filename.tif
- qh439/u6t2/filename.tif
- 439/2/filename.tif
The file name is constructed with two four-digit sections. The first four digits are sequential and use leading zeros to keep the images in the order in which they appear in the book or archival collection. The second four digits refer to page number. If pages are numbered, the actual page number is used. If not, we use the following codes:
| file name | type of page |
|---|---|
| 2ef | 2nd endpaper, front of book |
| 1eb | 1st endpaper, back |
| cf | Cover, front |
| cb | Cover, back |
| sp | Spine |
| tp | Title page |
| tpv | Title page verso |
| 3u | 3rd unnumbered page |
| 3r | Page 3, roman numeral(iii) |
| 3p | 3rd plate |
| 3pv | 3rd plate verso |
File name examples are:
| file name | description of page |
|---|---|
| 0010003p | 10th consecutive page, 3rd plate |
| 001103pv | 11th consecutive page, 3rd plate verso |
| 01230ef3 | 123rd consecutive page, 3rd back endpaper |
| 0005002r | 5th consecutive page, roman numeral 2(ii) |
Color Management
The generic ICC Color Profile for the Phase One PowerPhase scanning back is very accurate. We do NOT embed the profile in the image, but burn a copy to each CD of archived image files. A "Read Me" file explains how to apply the ICC Profile once a specific use or product is known. At this point, the ICC Profile for the Phase One scanning back is applied, and then converted to the destination ICC Profile.
To avoid changing the color profile of the images accidentally (until burned to CD), images are opened in Photoshop 5.0 or 5.5 as follows: From the File Menu go to Color Settings. Set "Assumed Profiles" to "None" and "Profile Mismatch Handling" to "Ignore". Make sure the "Embed Profile" checkboxes are unchecked. When these instructions are followed, no color management is applied to the image and it remains unchanged.
Processing for the Web
Published Text
Images are converted to grayscale, cropped, than converted to what Photoshop labels "Bitmap", or bitonal. If the conversion to bitmap results in illegibility, use Levels, AutoLevels, or Threshold to correct the problem. Other tools may also be used, including the Eraser, Magic Wand, and Brushes of various sizes. Other useful tools are the Noise filters, specifically the Despeckle and Dust & Scratches filters. Converting to grayscale can be done as a batch process. Cropping can sometimes be done as a batch if the cropping is even in all images and can be uniformly selected, using Photoshop's Magic Wand tool. No resizing is done on the text images of older books. Anything less than 300 dpi does not always give good results once imported into Adobe Acrobat.
Next, bi tonal TIFF images are imported into Adobe Acrobat. Acrobat handles monitor display size, navigation, and print size very well. The use of Acrobat, a proprietary software, is controversial. However, we chose to use it because it has become a de facto standard; is familiar to most people who browse the Web; the viewer is free to users; images display well; and print size is handled automatically by the software. In addition, and importantly, processing text images using Acrobat is easy for staff to learn.
Illustrations, Published or Original
All illustrations are converted to color JPEGS. We initially converted black and white lithographs to grayscale but soon realized that there was little difference in file size between grayscale and color images. All color JPEGS are downsized from the original high resolution TIFF images to 75 dpi. Next, color profile conversion is executed and the color bars in each image are checked for accuracy. Images are cropped to just inside the page edges. Vertical images are set to 400 pixels in width, and horizontal images to 500 pixels in width. Some older monitors display at 640 pixels across. With image size set to 500 or 600 pixels wide, users will not have to scroll across the screen. We also make sure that the output size is less than 8 1/2 inches wide by 11 inches high for the purposes of printing.
If users want to see high-resolution versions of images, they may contact us. By supplying low-resolution images on the Web, we are able to control how our images are used. Digital watermarks have not been applied to any of the images.
Manuscripts
Items from the Archives and Manuscripts collection have been digitized, but not yet processed for the Web.
Providing Web Access
As noted earlier, we are a non-profit library with a small staff and limited computer support. We do not have the resources or expertise to support SGML, SQL, or other computer languages or programs that require specialized skills and time. Data entry must be simple. We decided on linking images to our online catalog, using Millennium Media from Innovative Interfaces, Inc. Millennium Media has advantages in our situation:
- Data entry is very easy;
- No programming skills are required;
- The vendor handles upgrades and improvements;
- The Web interface is user friendly;
- Searches can be done using the OPAC interface, allowing the user to search by author, title, title keyword, or caption keywords.
Disadvantages of the OPAC solution are:
- The images are not searchable by Web spiders so are not found through Web search engines;
- Does not utilize XML;
- Does not manage Administrative Metadata (see below);
- Full-text searching of PDF files requires an additional Innovative module, at extra cost.
Optical Character Recognition (OCR)
OCR has not been applied to the text images. The type face is 150 to 200 years old, with inconsistencies in how the ink was applied - sometimes light, sometimes heavy. Stains and foxing complicate the issue. OCR would have been ineffective, and we did not have the funding to send materials out for re-keying. Access is provided by author, title, subject, and keyword searching at the collection level, rather than full-text.
Administrative Metadata
Administrative metadata is the data which describes the digital object: when it was scanned (date and time); by whom; using which equipment; file size and resolution; and rights and permissions. The Dublin Core is not detailed enough to provide information at this level, and there are no turn-key products yet available that manage this kind of data specific to library materials. We created a FileMaker Pro database to manage this data, but hope at some point to transfer the data to an XML or other international standardized format. There is currently a great deal of work going on in this arena, and once standards are finalized vendor products will appear. Several important projects in the area of digital image metadata are:
- Metadata Encoding & Transmission Data (METS)
- Metadata Object Description Schema (MODS)
- Technical Metadata for Digital Still Images (NISO Z39.87)
- Open Archival Information System (OIAS)
Lessons Learned
Allow for more time than calculations predict. Unexpected events will slow the project.
Older books are difficult, especially if stained or foxed. The digital images require extra processing, using such tools as Levels, AutoLevels, and Threshold functions in Adobe Photoshop. Once an image is changed to bitonal, stains and foxing can interfere with the legibility of the image if not removed initially.
In this example, a foxed page (left) is converted to bitonal (right) with no modifaction. The black and white image is difficult to read.

Below, AutoLevels are applied (left), which evens out the spread of highlights and shadows. Now, when converted to bitonal (right), the text is much more readable. Further refinements can be made for even greater legibility.

Each book is unique and presents unique challenges. Each must be evaluated and processed differently. Factors include page size; plate size; illustrations in color or black and white; amount of detail in an illustration; font size; type inconsistencies; foxing; type showing through a page; tight or loose book bindings; gutter size; margin size; handwritten notes in margins; author's signature; and so on.
We started off with the original version of Innovative's image management software, then switched to the beta version of the new product mid-stream. Workflow changed, processing procedures changed, image formats were changed from TIFF to JPEG, and bugs in the software all slowed the process.
Currently, our image metadata is stored in a FileMaker Pro database. Once metadata standards for digital images have been finalized and become stable, we plan to transfer the data from the current database.
Quality control takes time.
All steps take longer when file sizes are large. Burning to CD, processing for the Web, moving images to another computer or server —all take longer.
Originals scan better than transparencies, resulting in more accurate color and clarity of detail.
Equipment
Digital Image Acquisition Workstation for Scanning and Copy Work
- Apple PowerMac G3 with 2 x 4GB Hard Drives and Internal 2GB Jazz Drive
- 21" Radius PressView display
- APC Surge Arrest
- PowerPhase PhaseOne Camera Back
- Hasselblad Medium Format Camera
- 80mm Planar Lens
- 120mm Macro Lens
- 32E and 56E Extension Tubes
- TTI Repro-Graphic Copy Stand
- 2000W Quartz Halogen Lights
- Transmissive Light Source for transparencies and slides
Digital Image Acquisition and Image Processing Workstation
- Apple PowerMac G3 with 2 x 4GB Hard Drives and Internal 2GB Jaz Drive
- 21" Radius PressView display
- APC Surge Arrest
- LinoType-Hell OPAL ultra Flatbed Scanner with LinoColor software
- APS CDR-W Pro CD Burner with Toast software
- External Iomega 100MB Zip Drive
Dedicated Server, Image Archiving, and Central Storage
- Windows NT on Dell PowerEdge 4200
- 3 x 9.1GB LVD SCSI Hard Drives with RAID
- 1.4 KVA/1 DW Uninterruptible Power System (UPS)