
![]() | A worn dictionary is usually a good sign – an indicator that it is getting plenty of use. But with the recent revolution in dictionary development and design, it may also mean that your copy is out of date. |
| Click here for the highlights of four dictionaries on CD-ROM |
In this age of rapidly changing technology, replacing an outmoded computer, upgrading a piece of software or even buying a new car are routine practices. Perhaps it’s time to add your aging English-language dictionary to the list.
The field of lexicography – the art and science of writing dictionaries – has undergone vast changes in the past twenty years. Defining words used to be largely a matter of introspection. Today, lexicographers have almost instant access to thousands of examples of words in the context of actual real-life written and spoken communication.
The reason, of course, is the availability of lightning-fast computers, sophisticated software, vast collections of written and spoken texts in digital form and almost infinite amounts electronic storage space to hold them.
Interestingly, the dictionaries that have benefited the most from the revolution in technology and methodology are those produced for non-native speakers of English. Unlike dictionaries for native speakers which focus almost entirely on meaning, "learners’ dictionaries", as they are normally called, provide considerable information on how to use words as well.
The revolution began in 1980 at the University of Birmingham. It was there that Professor John Sinclair established the Cobuild project to carry out a computer analysis of a collection of texts – known as a "corpus" (plural: corpora) – totaling eighteen million words. Today that corpus has grown to over 400 million words and plans are afoot to add another 100 million words in the near future.
Cobuild is actually an acronym made up of "Co" for Collins, the publishing company, and "build" for Birmingham University International Language Database. It is actually the publishing arm of the university’s English language research programme with its primary products being learners’ dictionaries and reference books.
Two prominent Birmingham University lexicographers who joined the Cobuild project in its early stages were in Thailand recently. Ramesh Krishnamurthy is still with the team and Gwyneth Fox has become the project manager for the division of Macmillan Publishers which produces its learners’ dictionaries. Archarn Terry caught up with them at the Thai Tesol Conference in Chiang Mai.
The Cobuild corpus
![]() Ramesh Krishnamurthy, Cobuild |
Ramesh is actively involved in building Cobuild’s massive and ever-changing corpus of modern English. How modern is it? "I can almost guarantee that all the language in there has been composed since 1990," he says. "And every year we revise it so if we have older data in there we would drop it out and bring in new."
Where do the texts come from? Just about everywhere. "Obviously one of the easier sources nowadays is newspapers," he explains, "so we do have quite a lot of journalism in there. Most of the main broadsheet newspapers from the UK, tabloid newspapers as well. We’ve always tried to maintain a roughly 75%-25% distribution between British and American, as we recognise those two major world writings."
In addition to newspapers, Ramesh says, the corpus includes an extensive collection of texts from general consumer magazines and specialist magazines as well as business publications like the Wall Street Journal and The Economist.
Spoken English is represented as well. "About 20 million words of American radio broadcasting from National Public Broadcasting in Washington, 20 million words from the BBC World Service, lots of local radio stations, plus between 10 and 20 million words of informal conversations, interviews, meetings", he says. "We’ve put tape recorders anywhere where it’s sensible to put them."
Currently, says Ramesh, the Cobuild team is working to give the corpus a more international flavour. "We’ve acquired Australian English; we’ve just acquired 20 million words of Canadian English. We’re negotiating for Indian English and we want to increase the internationalism of the corpus," he says.
Insights
![]() Gwyneth Fox, Macmillan |
With the current technology, new insights into the English language are a regular occurrence, even for a seasoned lexicographer like Gwyneth Fox. As a recent example, she tells a short story about a Polish friend of hers who speaks very good English.
"He wrote to me one day ‘As you know Gwyneth, I’m averse to cigarettes.’ And I read it and I thought, yes, that’s right, he doesn’t like – and then I stopped and thought that sounds funny to me. So I went and looked at the corpus data and discovered that every single example we had, was ‘not averse to’.
Her search, she says, only took a matter of seconds. "It’s very very easy. Once you’ve got your corpus, however big it is, all you literally do is type in the word ‘averse’ and press ‘return’. What you get is called a concordance and the word ‘averse’ is in the middle and the context it’s being used in is on either side. So you’ve got your word ‘averse’ down the middle and you look to the left and all you see is ‘not’ or ‘n’t’. So it’s terribly, terribly easy."
Of course, as a native speaker, Fox already knew intuitively how to use ‘adverse’. The check of the corpus merely confirmed her suspicions. As a lexicographer, however, she immediately recognised that this is exactly the type of information her Polish friend would need in a dictionary aimed at non-native speakers. Supplying such usage information is indeed one of the important aspects which sets learners’ dictionaries apart from standard dictionaries.
Corpus research has led to another difference as well. Ramesh recalls the very first analysis he did with the Cobuild team back in 1984. He was looking at the word "surge" and he had what he calls "my first shock".
"It was interesting because coming from a more traditional linguistic background, obviously I knew the connection of ‘surge’ with the Latin. I thought OK it’s going to be about tides and rocks. Out of 400 examples, only four were for surging tides and surging waves.
"The most common usage was in journalism, in economic journalism, talking about a surge of imports, a surge in interest rates. So most of it was to do with mathematics, statistics, proportions, and rates of change. Then, roughly in this order it was, I think, surging emotions – a surge of joy, a surge of pleasure, a surge of despair. Then it was things like crowd movements – the crowd surged forward. And then finally, you had these four lines for surging waves.
This, Ramesh says, brought up an immediate problem. How do you order the various meanings in a dictionary? Even today, standard dictionaries tend to begin with the core or original meaning of a word. The Cobuild team quickly realised, however, that in cases like "surge", traditional ordering would not be very helpful to non-native speakers.
"Because the evidence was so overwhelming," says Ramesh, "that only four were for this so-called core meaning, that (meaning) was placed last. Basically, it was a matter of saying, ‘How often is the student going to come across this meaning?’ And if it’s only going to be four out of 400, they’re going to need that last. So you need to give them first surging imports."
Idioms are rare
One significant and almost counterintuitive finding from corpus research is the rarity of idioms in both written and spoken English.
"They are much much rarer than you think,’’ says Fox. "At Cobuild at one point we were writing a dictionary of idioms. I was reading the text and I came to the letter ‘r’. I think every learner in the world knows ‘It’s raining cats and dogs’. It wasn’t there and it was because we didn’t have any examples.
"I said to the person who was editing it, You can’t have a dictionary of idioms without ‘it’s raining cats and dogs’ for learners, because every learner in the world knows it. And she said, I can’t put it in. And I said you’ve got to go find some examples. We were lucky, because the Internet had kind of just started really and we could find a few on the Internet. So it crept in like that."
Ramesh readily concurs that idioms are low-frequency items. "These are among the rare items, four times per million words is not uncommon. The place where you will find them heavily used is in newspaper editorials," he says.
Bi-lingual dictionaries
While learner’s dictionaries clearly get the nod over standard dictionaries for non-native speakers, is there still a place for the ever-popular bi-lingual dictionaries?
Yes, there is, says Fox, particularly if you simply need to know what a word means.
"If you look up a word like ‘daffodil’(in a learners’ dictionary), it will say something like ‘a yellow flower which flowers in the springtime’. But then, there are other yellow flowers that flower in the springtime. That’s where a bilingual dictionary would be useful because there’s a one-to-one equivalent there. But if there was something interesting about the way ‘daffodil’ was used, then that’s what the learners’ dictionary would say."
A word of caution, however: "The problem with bilingual dictionaries," says Fox, "particularly the small ones is that they give you a list of different meanings and it’s really difficult for a learner to work out which is the translation they need."
The future
Looking to the future, Fox says she and her team at Macmillan are eager to develop and refine the learners’ corpus they have been working on. "It shows us the words that learners use and it shows us the mistakes that learners typically make, she explains."
This project, together with the continual expansion and analysis of the main corpus promises to keep her very busy. "There’s more than enough to keep me going for my lifetime," she concludes.
According to Ramesh at Cobuild, now is the time for software designers and lexicographers to work together to take their field to the next level. "I feel we’re coming to the end of the first phase of corpus building," he says.
"Software has developed, but it’s reached a kind of plateau. I think it’s now a time for the programmers to make a real leap in their vision of what kind of analyses are a) possible and b) for people like me, lexicographers, what is useful. There is still a lot that could be done. As the corpus gets bigger, you want the software to be more and more sophisticated."
Click here for the highlights of four dictionaries on CD-ROM.