Sunday 8 November 2009

ePubs and quality

You may have heard news about the release of "bookserver" by the good folks at the Internet Archive. This is a DRM-free ePub ecosystem, initially stocked with the prodigious output of Google's book scanning project and the Internet Archive's own book scanning project.

To see how the NZETC stacked up against the much larger (and better funded) collection I picked one of our Maori Language dictionaries. Our Maori and Pacifica dictionaries month-after-month make up the bulk of our top five must used resources, so they're in-demand resources. They're also an appropriate choice because when they were encoded by the NZETC into TEI, the decision was made not to use full dictionary encoding, but a cheaper/easier tradeoff which didn't capture the linguistic semantics of the underlying entries, but treated them as typeset text. I was interested in how well this tradeoff was wearing.

I did my comparison using the new firefox ePub plugin, things will be slightly different if you're reading these ePubs on an iPhone or Kindle.

The ePub I looked at was A Dictionary of the Maori Language by Herbert W. Williams. The NZETC has the 1957 sixth edition. There are two versions of the work on bookserver. A 1852 second edition scanned by Google books (original at the New York Public library) and a 1871 third edition scanned by the Internet Archive in association with Microsoft (original in the University of California library system). All the processing of both works appear to be been done in the U.S. The original print used macrons (NZETC), acutes (Google) and breves (Internet Archive) to mark long vowels. Find them here.


Lets take a look at some entries from each, starting at 'kapukapu':


NZETC:

kapukapu. 1. n. Sole of the foot.

2. Apparently a synonym for kaunoti, the firestick which was kept steady with the foot. Tena ka riro, i runga i nga hanga a Taikomako, i te kapukapu, i te kaunoti (M. 351).

3. v.i. Curl (as a wave). Ka kapukapu mai te ngaru.

4. Gush.

5. Gleam, glisten. Katahi ki te huka o Huiarau, kapukapu ana tera.

Kapua, n. 1. Cloud, bank of clouds. E tutakitaki ana nga kapua o te rangi, kei runga te Mangoroa e kopae pu ana (P.).

2. A flinty stone. = kapuarangi.

3. Polyprion oxygeneios, a fish. = hapuku.

4. An edible species of fungus.

5. Part of the titi pattern of tattooing.

Kapuarangi, n. A variety of matā, or cutting stone, of inferior quality. = kapua, 2.

Kāpuhi, kāpuhipuhi, n. Cluster of branches at the top of a tree.

Kāpui, v.t. 1. Gather up in a bunch. Ka kapuitia nga rau o te kiekie, ka herea.

2. Lace up or draw in the mouth of a bag.

3. Earth up crops, or cover up embers with ashes to keep them alight.

kāpuipui, v.t. Gather up litter, etc.

Kāpuka, n. Griselinia littoralis, a tree. = papauma.

Kapukiore, n. Coprosma australis, a shrub. = kanono.

Kāpuku = kōpuku, n. Gunwale.



Google Books:

Kapukapu, s. Sole of the foot,

Eldpukdpu, v. To curl* as a

wave.

Ka kapukapu mai te ngaru; The wave curls over.

Kapunga, v. To take up with both hands held together,

Kapungatia he kai i te omu; Take up food from the oven.

(B. C,

Kapura, s. Fire, -' Tahuna he kapura ; Kindle a fire.

Kapurangi, s. Rubbish; weeds,

Kara, s. An old man,

Tena korua ko kara ? How are you and the old man ?

Kara, s> Basaltic stone.

He kara te kamaka nei; This stone is kara.

Karaha, s. A calabash. ♦Kardhi, *. Glass,



Internet Archive:

kapukapu, n. sole of the foot.

kapukapu, v. i. 1. curl (as a wave). Ka kapukapu mai te ngaru. 2. gush.

kakapii, small basket for cooked food.

Kapua, n. cloud; hank of clouds,

Kapunga, n. palm of the hand.

kapunga, \. t. take up in both hands together.

Kapiira, n. fire.

Kapiiranga, n. handful.

kapuranga, v. t. take up by hand-fuls. Kapurangatia nga otaota na e ia. v. i. dawn. Ka kapuranga te ata.

Kapur&ngi, n. rubbish; uveds.

I. K&r&, n. old man. Tena korua ko kara.

II. K&r&, n. secret plan; conspiracy. Kei te whakatakoto kara mo Te Horo kia patua.

k&k&r&, D. scent; smell.

k&k&r&, a. savoury; odoriferous.

k^ar&, n. a shell-iish.


Unlike the other two, the NZETC version has accents, bold and italics in the right place. It' the only one with a workable and useful table of contents. It is also edition which has been extensively revised and expanded. Google's second edition has many character errors, while the Internet Archive's third edition has many 'á' mis-recognised as '&.' The Google and Internet Achive versions are also available as PDFs, but of course, without fancy tables of contents these PDFs are pretty challenging to navigate and because they're built from page images, they're huge.

It's tempting to say that the NZETC version is better than either of the others, and from a naïve point of it is, but it's more accurate to say that it's different. It's a digitised version of a book revised more than a hundred years after the 1852 second edition scanned by Google books. People who're interested in the history of the language are likely to pick the 1852 edition over the 1957 edition nine times out of ten.

Technical work is currently underway to enable third parties like the Internet Archive's bookserver to more easily redistribute our ePubs. For some semi-arcane reasons it's linked to upcoming new search functionality.

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).