The Gutenberg project makes available over 33,000 previously published books in the form of e-books for free. This is done with the help of thousands of volunteers – a project called Distributed Proofreaders. The contributions made by these volunteers empowers readers to enjoy these books on Apple’s ipad, Kindle, Android, and similar platforms.
With OCR or even manual typing there will be several errors in the text produced. Human proofreading becomes a necessary activity before the book is converted into a downloadable e-book. Similar to translation, only a real person can spot and correct the errors. I often notice at least a few typos in newly published books. I wonder why authors don’t employ crowd-sourcing to get their chapters proofread. The ability to read the content early is reward enough for volunteers.
Old magazine articles, comics and famous letters from Indian can be made available with the power of distributed or crowd-powered proofreading. It’s unfortunate that there there are no digitized old books available in Indian languages on Gutenberg.
Using Dubzer (free crowd-sourced proofreading), Lipikaar (easy unicode-based typing for Indian languages), Pothi (self-publishing, print on demand, downloadable e-books), and other such web-based platforms we can create a digital library for timeless Indian content whose copyright has expired and can be publicly distributed. Even semi-urban or rural folks who read well in their local language and have poor access to libraries will be empowered to make reading an enjoyable leisure activity. With India’s 3G powered smart-phone revolution, is this hard to imagine? We can initially aim to create 100 e-book titles in each Indian language including English.
The possibilities are exciting and challenging. These ideas came up as a result of our conversations with Abhaya Agarwal, co-founder of Pothi.com, who has a keen interest in the work published by Indian authors/journalists who did not have the benefit of digitization.
We would love to jump start this initiative with a group of like-minded folks. Do write to us if you have any of these – insights or leads to such attempts, OCR expertise, relevant OCR open-source software, timeless books/articles/magazines/literature, typed text, etc. Even if you don’t have these please join in with your ideas and enthusiasm. Students are welcome too!
Update (February 1, 2011)
The recent data from Lipikaar shows that we have gathered users across the spectrum.
No one language accounts for more than 20% of users. A year ago we had Hindi and Punjabi dominating our charts.
The top 10 languages used by Lipikaar users are – Hindi (19%), Arabic (17%), Punjabi (13%), Marathi (10%), Gujarati (8%), Telugu, Malayalam, Bengali, and Tamil. Urdu and Kannada are tied at the 10th spot.
On the Applications front, we have users across 300 Unique Software Applications! Users have typed in the above Indian languages on 300 different Windows Applications. The most popular one Microsoft Word accounts for only 3%!
The top applications – Microsoft Word, Excel, Access, Internet Explorer, Acrobat, Firefox, Chrome, Outlook, Notepad, PowerPoint, GoogleTalk, Yahoo Messenger, PhotoShop, and so on.
Some of the new entrants that are being actively used with Lipikaar are Google Earth and iTunes.
After powering the PC and websites, we’re gearing up to power the mobile phone with Indian languages. Do send us your ideas. Write to me if you would like to include Lipikaar with your software or mobile application.
Alok Kejriwal tore in to the ‘gLocal’ strategy citing how Orkut and Facebook have eaten in to BigAdda’s share. There is zero advantage gained from geographic positioning when marketing your Web application online.
Alok was at his best, pulling up factual Web Analytics data from Games2Win to show what the head and the tail is like on the Web. Read the rest of this entry »