Deals from Amazon

Thursday, August 14, 2008

Web Security Words Ditigizing Libraries

When you try to buy anything on the website or open an email from one of the free email services you will be asked to enter random words into a text box. This is just to verify that the person entering information or opening the email account is a human and not a spammer computer.

According to Luis von Ahn, a computer scientist at Carnegie Mellon University in Pittsburgh, "Approximately 200 million of these are typed every day by people around the world. Each time you type one of these, essentially you waste about 10 seconds of your time," he says. "If you multiply that by 200 million, you get that humanity as a whole is wasting around 500,000 hours every day, typing these annoying squiggly characters."

So in order to harness all this man-hours von Ahn came up with a very innovative idea. He knew that lots of libraries have huge efforts under way to digitize their collections. These projects first scan books or newspapers by basically taking a picture of each page. Then a computer takes the image of each word and converts it into text, using optical character-recognition software.

But computers often come across printed words they just can't recognize. "Especially for older documents, things that were written before 1900, where the ink has faded and the pages have yellowed out, the computer makes a lot of mistakes," says von Ahn.

A human being has to look at those words and decipher them. It occurred to von Ahn that he could link this kind of activity to security devices used on the Internet. Instead of asking people to prove they're human by copying random sequences of distorted letters and numbers, he could ask them to decipher mystery words from scanned books and newspapers.

So he got together with The New York Times, which is digitizing newspapers going back to 1851, and a nonprofit called the Internet Archive, which is digitizing thousands of books.

And now, if you go to someplace like Ticketmaster to buy, say, Jimmy Buffett tickets, you'll be shown images of not one but two distorted words.

One of these is the real security word: Type this one correctly and you're in. The other image is something that has mystified the digitizing software.

If people recognize that word, they type it in. This image will actually be shown to several people. If they all agree on what the word is, it will be considered accurately transcribed. And von Ahn says it will be incorporated into the digitized copy of the book or the newspaper that it came from.

Read the complete article on NPR

No comments:

Post a Comment