English is the most common human language in North America as well as in other world regions. As languages go, it is rich in content with 500,000 everyday words and at least that many technical words, dwarfing the German vocabulary of about 185,000 words and the French vocabulary of less than 100,000. English is also transmitted to more than 100 million people every day by the five largest broadcasting companies (CBS, NBC, ABC, BBC, and CBC). According to the Top 10 countries with the most English language speakers, the current Top 10 English-speaking countries are as listed below.
Each human language is characterized by its own alphabet, words, phonemes, and grammar (attribute or BNF grammar, a narrower concept than the everyday usage). Other intermediate categories like nouns, clauses, and phrases exist in a hierarchy ranging from the sounds of a language to the meaning of sound combinations.
The words of the English language are written using a set of 26 letters from a Latin-based alphabet in use in England since the 7th century AD.
The sentence "The quick brown fox jumps over the lazy dog" contains all of the letters in the English alphabet.
An Old English alphabet of 24 letters in use since about 1000 AD gave rise to the present alphabet. Its letters and their relative frequencies in the English language are shown below.
This frequency-of-letters table is the basis for many encoding schemes for both encryption and compression. For example, it is well-known that
Morse code encodings are such that the most frequent letters have the shortest symbols, as shown here. A general method for determining the optimal encoding for an arbitrary sequence of strings and associated probabilities is called
Huffman coding, an example of which is
here. In the reverse case, if a simple 1-to-1 cipher is used to encrypt letters, the most common letter is apt to represent an "E", the second most a "T", etc. Many algorithms exist to hide these relationships.
Over 300,000 words occur in the English language, but only some 50,000 can be considered common. The phonemes comprising these words can be further refined on the basis of the succeeding and, less frequently, preceding phonemes, which cause us to pronounce a given phoneme in slightly different ways (e.g., consider the P in PUT and PAT). These phonetic variants are called allophones. Allophones are obviously difficult to analyze because they are very numerous, quite language-dependent, and highly context-sensitive.
Speech features are often displayed in a spectrogram, which plots sound frequency against time and depicts amplitude as a third dimension - darkness of the display. In the case of vowels, dark horizontal bands correspond to dominant frequencies or formants; the band having the lowest frequency is called the first formant (F1), the next highest in frequency is the second formant (F2), etc. Though individual vowels overlap considerably in F1 and F2, one of the important milestone studies in speech recognition involved a study (by G. E. Peterson and H. L. Barney ("Control methods used in a study of the vowels", Journal Acoustical Society of America 24:175-184, 1952) demonstrating how most vowels could be visually separated on a plot of F1 and F2. The table below summarizes the results of that study for adult males.
Vowels are made with the vocal cords vibrating and air
exiting the mouth. The vocal cords vibrate at a fundamental
frequency and multiples thereof (harmonics), as many as 40 of which may
occur. The fundamental frequency is about 120 Hz for the average man,
about 225 Hz for the average woman, and 300 Hz for the average child.
Diphthongs are vowels that change their sound
as they are uttered; the A in MADE is an example and is represented with the
code EY. These changes occur because of movements of the lips and tongue.
A language grammar is the set of rules that defines properly formed words, phrases and sentences of a language. To accomplish this, language words are generally placed into a small number of classes. Eight major word classes are defined for English: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and determiner. These classes are then grouped into legal phrases and clauses; a noun phrase might consist of a determiner + adjective + noun, as in "the red fox." Finally, sentences are defined in terms of clauses; one sentence type might be defined as a noun phrase + a verb phrase; for example, "The fox was white" is an example of a dependent clause that is also a simple sentence. However, this example could also be linked to another dependent clause to form a more complex sentence:
Early language grammars that preceded English were also defined by rules. According to History of English Grammars, the first formal definition of English was Pamphlet for Grammar by William Bullokar and published in 1586. This publication was written in part to demonstrate that English was quite as rule-bound as Latin. However, it was not until the 19th century that modern-languages became formally systematized and by the early 1900s publications were being written for the teaching and study of English, even as a foreign language and including description of the intonation patterns of English (see Grammar of spoken English (1924), by H. E. Palmer in reference above).
Language grammars that are processed by computer require a grammar rule format called Backus-Naur Form (BNF) or its more convenient extended format Extended BNF (EBNF). Both computer languages and human languages can be defined this way - plus almost any other rule-based notion. The basic idea is to define a hierarchal tree of rules with "what's being defined" (WBD) at its root:
If L1 .. L4 represent the respective digits 1 ..4, the single rule above defines WBD as either 1, 23, or 24, as the vertical bar represents alternatives and adjacent symbols are somehow ordered (concatenated in this example). Rules can also be written one to a line and non-terminal symbols on the left like WBD, L1, etc. actually should be defined:
The above grammar includes the same 4 quoted terminal symbols ('1', '2', '3', '4') and defines the same original sentences (1, 23, and 24). This set of sentences in called a language and it is context-free because the left side of every rule (also called a production) contains only one symbol.
Note that adding the rule L2 := '5' adds two new sentences, 53 and 54, to the language.
A recursive production like
is legal and in our example adds strings beginning with not just one but an indefinite number of '2's to the language, in effect making the language infinitely large:
This set of strings can also be written as (1, 2+3, 2+4), where the "+" means "one-or-more" (the asterisk* is used to indicate "none-or-more"). Infinite size is usually the case among all computer and human languages, which is why we need grammars for computer processing. Otherwise, problems like checking the syntax, analyzing meaning, voice recognition, and translation from one language to another could be accomplished by simple (though possibly very large) table lookup.
The greatest success at computer recognition of English sentences has involved context-free grammars with a small (100-1000) number of words. As examples, these include a a simple English grammar, a poetry grammar, and a grammar for an airline flight reservation system. The principal bottleneck seems to stem from the fact that English is not a totally context-free language. The way we parse and understand sentences depends on the context, either past or future, as mathematically proven here.
One alternative to a strict BNF representation of English is statistical parsing, which associates grammar rules with a probability used to apply the proper rule. One simple Windows system OpenNLP offers several tools including "a sentence splitter, a tokenizer, a part-of-speech tagger, a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks"), a parser, and a name finder."
Once the words and phrases of a speech input stream are put into parsed sentences on a computer, the problem becomes one of Natural Language Processing and is within the sphere of Artificial Intelligence. Syntax analysis using a compiler or parser determines whether the sentence is correct according to the grammar, and computational semantics attempts to discover the meaning of a sentence. Today, English text-to-speech (TTS) is a well-established technology, but to be totally useful for applications like an automated online assistant (below), computers must be capable of bidirectional natural language communication.
The link parser is available online here and its output for the sentence
"The fox was white because of its parent" is shown below.
EXTERNAL LINKS & REFERENCES
American English Phonemes
The BNF Web Club
Computational Linguistics (Wiki)
Number of Words in the English Language