Czech linguists take part in digital archiving of Holocaust survivors' testimonies

Two research centres in the Czech Republic are taking part in a pioneering international project - putting together a digital archive of the testimonies of Holocaust survivors, videotaped by the Survivors of the Shoah Visual History Foundation established by the American film director Steven Spielberg.

"The rest that they didn't kill they put on these platforms, on those big lorries, and they were taking them away. I saw my grandmother being taken away on the lorry. My grandmother..."

An elderly woman, who wished to be known only as Kristine, is recalling her terrible memories from a Nazi concentration camp. She is one of the 52,000 people who agreed to record their testimonies on video as part of a project started by the Survivors of the Shoah Visual History Foundation ten years ago. The foundation was established by Steven Spielberg who recorded a vast number of those testimonies for his 1993 Oscar-winning film Schindler's List.

"They really collected an extremely large collection of testimonies but now the problem is how to work with them. It's not just a collection for its own sake. It's a collection for historians and researchers to put the information together. Now this is the task of this research project: to look at these many different utterances and discourses and to make connections between them and to collect information on a particular date, particular location, particular events."

Professor Eva Hajicova from the Centre for Computational Linguistics at the Faculty of Mathematics of Charles University in Prague is describing a pioneering project launched by Steven Spielberg's foundation in which her research centre is taking part.

"They made 52,000 interviews and each interview lasted for 2 to 2.5 hour. So you can imagine that's material of some 100,000 hours. It perhaps does not seem so unimaginable to process the data but if you imagine one or two people sitting and listening to them, they wouldn't be able to index, to sort them, to classify them in dozens of years. So now, the basis of the project we work on is that there must be some more efficient way to process the data."

Digital archiving of the spoken word is an emerging method of capturing the human experience. Professor Hajicova's team is creating a digital catalogue of the recordings, using state-of-the-art technology and modern linguistic methods. Sophisticated voice recognition technology is used to allow historians to find keywords in thousands of hours of video recordings from Holocaust survivors.

"Computers are doing the processing and the programmes for the computers are being developed by people. Our centre is very proud to be one of the five or six participating parties. The initiative came from the United States, so this foundation is one of the participants but they bring the moral and - partially - the financial support. Then there is the NSF programme, the national research programme of the United States and the people who do the work are from the University of Maryland, John Hopkins University in Baltimore, the IBM Research Center and our centre in collaboration with our colleagues from Pilsen, from the West Bohemian University who are mostly concerned with the spoken language."

The catalogue that will be the end result of all this work needs to be accessible for historians, teachers and students. That's why the testimonies need to be indexed properly.

"The first task for us is to formulate and implement on computer programmes which would process the spoken data. Now the task would not be possible or realistic if we were asked to process the data and rewrite it in the form of text. So actually the programme is that we should recognise in the data the key notions, the key ideas, the key locations, key dates - and then it will be possible to search in the material."

In addition to working on an important information technology research project, Eva Hajicova's team are also contributing to an invaluable cultural resource.

"It's very exciting not only from the research point of view but also from the cultural and moral point of view, because the underlying idea of the foundation which was founded by Steven Spielberg was to use the opportunity to capture the life stories of those people who survived the Holocaust. Spielberg was aware that these people are getting very old and that their memories would be lost if they were not collected and registered."

The Czech teams in Prague and in the city of Pilsen are responsible for processing recordings in Czech, Russian, Polish and Slovak. Nearly half of the archive is made up of testimonies given in English. But overall, the accounts of Holocaust survivors and witnesses were given in as many as 32 languages.

"It's not that these languages would be spoken by native speakers. You can imagine that the survivors left their country after the end of the war. For instance the Czechs emigrated to the United States or to Great Britain or somewhere else. Their recordings are in English but it is not native English. But there are also survivors who talk in their native language. So one has to process recordings in a single language but in several varieties. That's one thing. The other thing for Czech speakers is that spoken Czech is different from written Czech. And these people when they told their stories, they were using colloquial language. So it's not just to take the analysis which we have for written texts and transfer it to the spoken language. It is necessary to have these different varieties of the language recorded as well."

Different language varieties are only one part of the challenge. There are also other factors that significantly influence the sound quality of the recordings and make them more difficult to decipher for computers.

"Again, these recordings were made by elderly people. They have different habits. They were very emotional, of course, because these were their bad memories. So the emotions are there. The oldest recordings are unprepared, the people were not reading what they were saying. One thing is speech recognition including multilinguality because there must be a translation somewhere between these key words or key notions, and the other problem is the search in these multilingual sources."

Professor Eva Hajicova says the Czech participants in the project are proud to be working with notable American institutions. She says the Czechs were invited to join them thanks to their long term experience in the field.

"I think it's because we have been collaborating with these Western universities, the universities in the United States with our other projects. Not just spoken language but also written language and other natural language processing projects which we had before. So that's how we were in contact when this idea of this huge project came about and we were sort of natural partners in this way."

More than 90 percent of the testimonies videotaped by the Survivors of the Shoa Visual History Foundation are from Jewish Holocaust survivors. Yet the archive also contains interviews with other survivors, including Jehovah's Witnesses, Sinti and Roma, homosexuals, political prisoners, and survivors of eugenics policies but also rescuers and aid providers, liberators, and participants of war crimes trials.

The name of this revolutionary project to create a digital archive of audio tracks is MALACH. It is an acronym meaning Multilingual Access to Large Collections of Text. But "malach" is also the Hebrew word for "angel".