Fw: Distributed Digital Library - Germany

Iris Radulescu (mailto:Iris.Radulescu@LIB.MONASH.EDU.AU)
Fri, 13 Dec 1996 15:08:44 +1100

Message-Id: <199612130415.WAA27116@library.wustl.edu>
Date:         Fri, 13 Dec 1996 15:08:44 +1100
From: Iris Radulescu <mailto:Iris.Radulescu@LIB.MONASH.EDU.AU>
Subject:      Fw: Distributed Digital Library - Germany
To: mailto:IMAGELIB@LISTSERV.ARIZONA.EDU

Hi,
just a note regarding digital document handling aspects:
----------
> From: LOSSAU <mailto:lossau@MAIL.SUB.UNI-GOETTINGEN.DE>
> To: mailto:IMAGELIB@LISTSERV.ARIZONA.EDU
> Subject: Distributed Digital Library - Germany
> Date: Friday, 6 December 1996 20:23
>
> Dear Colleagues,
>
> with beginning of 1997 the German Research Foundation (Deutsche
> Forschungsgemeinschaft) wants to promote the access to research
> materials via a new funding program, the retrospective digitisation of
> library materials. The aim is the establishment of a distributed
> digital library in Germany.
>
> b) First we want to scan text-books. Could you give recommendations
> concerning resolution, data format of image files (e.g. for long-time
> preservation, for distribution over the internet), compression,
> colour-depth, the tagging of image files (TIFF 6.0) (to give some
> informations like ,date and time scanned", ,artist", ,image
> description", ,document name", ,page number" and so on), the writing
> of (bibliographic) informations in the `comment-field' of the image
> file?
I am no authority in this area, however: our experience shows that a resolution of 200x200 is perfectly adequate for display and printing purposes, and gives reasonably good OCR results. 300x300 is much better for OCR - ing, and not much more expensive in terms of storage and transmission times. We use successfully TIFF CCITT Group 4 (compressed multipage).

However, for preservation purposes, I would think that the higher the resolution, the better the reproduction. It has been said, however, that microfilm is still the preferred preservation medium, and there are ways of digitising microfilm - ideally, at varying resolution levels and as many times as necessary. Using the TIFF tags for any purpose outside their documented scope as per specification will not give good results if the documents are viewed with third-party viewers, which ignore those tags in the best case. You would depend on your own software to write these tags and interpret them.

> Tests have shown, that OCR with older books (18./19. century)
> will cause problems. Do you have any ideas of dealing with this
> problems?
We have had this problem using Omnipage Pro 6.0 and an 18th century book. Scanning was done without greyscale and at low resolution. Printing press quirks made the ink spread unevenly in the characters, not just separating ascenders and descenders, but even leaving "holes" in the main character body. However, from another Australian project, I hear the results were much better when scanning a similar book directly into Omnipage! This is consistent with the behaviour of the program even on modern prints. >
> /2/ Administration and structuring of digital documents
Just my contribution in this: I advocate storing the images as files on magnetic or optic storage. The DBMS should have any sort of data about the document plus the file-name or relative path to the file. This is in contrast with another approach, which is to store images as BLOB's inside the database. Personally, I would be very uneasy doing the latter.

> /3/ Metadata
> What kind of (bibliographic) metadata would you recommend to tagg with
> the digital document in a Document-Management-System? We think on the
> (automatic) import of bibliographic data from our
> online-library-catalogue.
We have done the opposite: in our (mainframe !) catalogue, we create a field which holds the relative path to the image file. A CGI program which puts a Web interface on top of our mainframe catalogue program extracts the relative filename and makes it into an URL by appending a www address to it. This appears as a hyperlink in the HTML returned as a result of the search. Obviously, when clicked, the url is served and the machine where the image is stored delivers the file to the user's viewer. You may try this for an easy illustration : for a limited time, you can point your browsers to http://www-berwick.lib.monash.edu.au and try one of the course codes offered there (normally copyright arrangements do not allow access to scanned documents for other than Monash students).

>
> /4/ Distribution and Access
> One way of distribution will be the Internet. Could you give
> recommendations concerning the compression of image files (GIF,
> JPEG?), the possibility of downloading not only pages but also whole
> chapters of a book, the designing of the `user-interface'?
As you can see if you access our own site, what you get is one TIFF file (already compressed) which has as many as 35 - 50 pages. To my knowledge, no other format allows such degree of compression and manageability. There is a downside to it, which is the need for the entire document to be downloaded to the user's machine before it can be viewed : expensive in terms of traffic, and impossible to count how many pages were accessed or printed by the user, if the copyright holder requires this information.

In the future, I will be looking at modifying our TIFF viewer to work in client-server mode and only deliver one page to the user's machine, by extracting it from the multipage TIFF file at the server end.

> Thank you very much for your informations
> Best regards
>
> Norbert Lossau (project officer)
> **************
> Dr. Norbert Lossau
> Niedersaechsische Staats- und Universitaetsbibliothek Goettingen
> Platz der Goettinger Sieben 1
> 37073 Goettingen
> Tel.: +551/39-5217 Fax. +551/39-5222
> E-Mail: mailto:lossau@mail.sub.uni-goettingen.de

Hope my notes will be of some assistance. I have also passed on your very well thought of statement of requirements to others in the Library who are now specifying a similar system. All the best Iris Radulescu Snr Analyst/Programmer - Systems Development Monash University Library - Systems Unit