Overview of Natural Language Processing

Kimberly Surico |
 07/06/15 |
8 min read

“NLP is not magic, but the results you can get sometimes seem almost magical.”

wei li

Author: Wei Li, Chief Scientist

The lack of standardized technical terminology in constantly evolving technology areas like Natural Language Processing (NLP) can make discussions about this technology challenging for a few reasons:

  • New researchers can be overwhelmed by an ever-growing and confusing list of technical terms
  • Different researchers may have different terms for the same concept
  • Terms can be ambiguous or refer to different things in different people’s minds

Without a clear consensus, the burden falls on the tech community to decode the semantics or referents behind technical terms in their broad or narrow sense, with the understanding that ambiguity is possible – though, of course, seasoned professionals sensitive to new terms can rapidly place them correctly in the existing concept hierarchy of the area.

In that spirit, I’ve created four flowcharts for this presentation, for you to carpet-comb a full load of NLP-related terms in a concept network. All technical terms mentioned in this series are underlined (and acronyms are shown in italics). There are also some hyperlinks for interested readers to further explore.

What is NLP?

Let’s start our journey with some background on the general concept of Natural Language Processing (NLP) and where it belongs, as well as some of the interchangeable or sister terms of NLP.

As the name suggests, Natural Language Processing refers to the computer processing of natural languages, for whatever purpose, regardless of the processing depth. “Natural language” means the languages we use in our daily life, such as English, Russian, Japanese, Chinese; it is synonymous with human language, mainly to be distinguished from formal language, including computer language.

Natural language is the most natural and common form of human communication, not only in the spoken form, but also in written language, which has grown exponentially in recent years due to mobile device and social media use.

Compared with formal language, natural language is considered a “problem area” thanks to colloquialisms, idioms, sarcasm and the like. It is much more complex, often rife with omissions and ambiguity, making it difficult to process (hence NLP as a highly skilled profession is a “golden rice bowl” for us NLP practitioners, :=)).

Heads or tails – CL and NLP

The (almost) equivalently used term with NLP is Computational Linguistics (CL). As the name suggests, computational linguistics combines Computer Science (CS) and Linguistics. In fact, NLP and CL are two sides of the same coin: The NLP focus is practice while CL is a science (theory).

Put another way: CL is the scientific basis for NLP and NLP is CL’s application.

Unlike basic disciplines, such as mathematics and physics, these two disciplines pose an inherent problem in that, having shortened the distance from theory to practice, CL and NLP are often treated as being the same thing in many use scenarios.

Its practitioners therefore can claim themselves to be NLP engineers in industry or computational linguists in academia. But they’re not quite interchangeable. Because although the computational linguists in academia also need to build NLP systems, their focus is to use the experiments to support the study of the theory and algorithms.

On the other hand, the NLP engineers in the industry are mainly charged with implementing real-life systems or building production quality software products.

That difference allows the NLP engineers to adopt whatever works for the case at hand (following Deng’s famous “cat-theory”: black or white, it is a good cat as long as it catches a rat), and be less concerned about how sophisticated or popular a strategy or an algorithm is.

Where does Machine Learning fit in?

Another term often used in parallel with NLP is Machine Learning (ML). Strictly speaking, machine learning and NLP are concepts at completely different levels. The former refers to a class of approaches, while the latter indicates an approach to a specific problem area.

However, due to the “panacea” nature of machine learning, coupled with the fact that ML has dominated the mainstream of NLP (especially in academia), many people forget or simply ignore the existence of the other NLP approach – the problem-solving aspect, namely hand-crafted linguistic rules.

Thus, it is not a surprise that in these people’s eyes, NLP is machine learning.

In reality, of course, machine learning goes way beyond the field of NLP. The machine learning algorithms used for various language processing tasks can also be used to accomplish other artificial intelligence (AI) tasks, such as stock market analysis and forecast, detecting credit card fraud, machine vision, DNA sequencing classification, and even medical diagnosis.

Hand-crafted rules

In parallel to machine learning, the more traditional approach to NLP is hand-crafted rules, formulated and compiled by linguists or knowledge engineers. Sets of such rules for given NLP tasks form a computational grammar, which can be compiled into rule systems.

Machine learning and rule systems have their own advantages and disadvantages. Generally speaking, machine learning is excellent at coarse-grained tasks like document classification or clustering, and the hand-crafted rules are good at fine-grained linguistic analysis, such as deep parsing.

More simply, if we compare language to a forest and the sentences of the language to trees, machine learning is an adequate tool for overviewing the forest while the rule system sees each individual tree. In terms of data quality, machine learning is strong in recall (coverage of linguistic phenomena), while rules are generally good at precision (accuracy).

Machine learning and rules complement each other fairly naturally, but unfortunately, there are some “fundamentalist extremists” from both schools who are not willing to recognize the other’s strengths and try to belittle or eliminate the other approach.

Due to the complexity of natural language phenomena, a practical real-life NLP system often needs to balance or configure between precision and recall and between coarse-grained analysis and fine-grained analysis. Thus, combining the two NLP approaches is often a wise strategy.

A simple and effective method of combining them is to build a backup model for each major task: grammars are first compiled to run for detailed analysis with high precision (at the cost of modest recall) and machine learning is then applied as a default subsystem to catch up the recall.

Keep in mind that both approaches face the resource issue of so-called knowledge bottleneck: grammars require skilled labor (linguists) to write and test, while ML requires huge labeled data; especially for the relatively mature method of supervised learning, sizable (human-)labeled data (or annotated corpus) are a precondition.

The AI/ML relationship

It is worth mentioning that traditional AI also relies heavily on a manually-coded rule system, but there is a fundamental difference between the AI rule system and the grammar-based system for linguistic analysis (parsing). Generally speaking, computational grammars are much more tractable and practical than the AI rule system.

An AI rules system not only needs linguistic analysis done by a sub-rule system like computational grammar, it also attempts to encode the common sense (at least the core part of human accumulated common sense) in its knowledge representation and reasoning, making AI much more sophisticated, yet often cumbersome, for practical applications.

In a sense, the relationship between ML and AI is like the relationship between NLP and CL: ML focuses on the application side of AI, while people would assume that AI should then be in a theoretical position to guide ML.

In reality, though, it is not the case at all: the (traditional) AI is heavily based on knowledge encoding and representation (knowledge engineering) and logical reasoning with the overhead often too big and too complicated to scale up, or too expensive to maintain as real life intelligent systems.

Building intelligent systems (NLP included) has gradually become the domain of machine learning whose theoretical foundation involves statistics and information theory instead of logic. The AI scientists, such as the Cyc inventor Douglas Lenat, are rare today as statisticians dominate the space.

Perhaps in the future, there will be a revival of the true/pure AI, but in the foreseeable future, machine learning, which models human intelligence as a black box connecting the observable input and output, clearly prevails.

Note that the difference between an impractical (or too ambitious) AI knowledge engineering approach and the much more tractable NLP computational grammar approach determines their different outcomes: while (traditional) AI is practically superseded by ML in almost all intelligent systems, the computational grammar approach proves to be viable and will continue to play a role in NLP for a long time (albeit facing constant prejudice as well as challenges from ML and statisticians).

Understanding vs. processing

There is also another technical term almost interchangeable with NLP, namely, Natural Language Understanding (NLU).

Although the literal interpretation of NLU as a process for machines to understand natural languages may sound like science fiction with a strong AI flavor, in reality the use of NLP vs. NLU, just like the use of NLP vs. CL, is often just a different habit adopted in different circles. NLP and NLU are basically the same concept.

By “basically,” I mean that NLP can refer to shallow or even trivial language processing (for example, shallow parsing, including tasks like tokenization for splitting a sentence into words, and morphology processing such as stemming for splitting a word into stem and affix), but NLU by definition assumes deep analysis (deep parsing). Here is the thing: viewing through an AI lens, it is NLU; looking from the ML perspective, it should only be called NLP.

In addition, the natural language technology or simply language technology are also common referents to NLP.

A broader view of NLP

Since the NLP equivalent CL has two parents, computer science and linguistics, it follows that NLP has two parents too: in fact, NLP can be seen as an application area of both computer science and linguistics.

In the beginning, the general concept of Applied Linguistics was assumed to cover NLP, but due to the well-established and distinguished status of computational linguistics as an independent discipline for decades now (with “Computational Linguistics” as the key journal, ACL as the community, and ACL annual conference and COLING etc. as the top research meetings), Applied Linguistics now refers mainly to language teaching and practical areas such as translation, and it is generally no longer considered as a parent of NLP or computational linguistics.

Conceptually, NLP (as a problem area) and ML (as methodology) both belong to the same big category of artificial intelligence, especially when we use the NLP-equivalent term natural language understanding or when specific applications of NLP, such as machine translation, are involved.

However, as mentioned above, the traditional AI, which emphasizes knowledge processing (including common-sense reasoning), is very different from the data-driven reality of the current ML and NLP systems.

Related hierarchy of NLP

Now that we have made clear where NLP belongs and identified the synonyms or sister terms of NLP, let us “grasp the key links” to look at NLP per se and to sort out the mystery of the related hierarchy of concepts and terms.

The four flowcharts below present NLP at four levels, using a conceptual system architecture (to be repeated in each subsequent chapter). The four levels are:

(i) linguistic level; (ii) extraction level; (iii) mining level; (iv) app level.

These four levels of (sub-)systems basically represent a bottom-up support relationship: 1 ==> 2 ==> 3 ==> 4. Clearly, the core engine of NLP (namely, a parser) sits in the first layer as an enabling technology, and the app level in the fourth layer includes applications like question answering, machine translation, and intelligent assistants such as Siri.

As a final note, since natural language takes two forms, speech (oral form) and text (written form), NLP naturally covers two important branches in speech processing: (i) speech recognition designed to enable computers to understand human speech; (ii) speech synthesis to teach computers to speak back to humans.

As I am no expert in speech applications, this series of talks will only deal with text-oriented NLP, assuming speech recognition as a pre-processor and speech synthesis as a post-processor of our theme text processing. In fact, this is a valid assumption of labor division even in the actual language systems we see today.

For example, the popular application of NLP in smart phones, such as iPhone’s Siri, uses speech recognition to first convert speech to text, which is then fed to the subsequent system for text analysis and understanding.

We show this in the four flowcharts below, and will present them one by one in detail in subsequent chapters of this series.

weiblog1 weiblog2 weiblog3 weiblog4

This blog first appeared here.

About the Author: Dr. Wei Li has been the Chief Scientist since 2006. At NetBase, Dr. Li has been focusing on leading the development of information extraction and sentiment analysis technology to power technology searches as well as social media analysis of brands.

Premier social media analytics platform

Expand your social platform with LexisNexis news media

Power of social analytics for your entire team

Media analytics and market intelligence platform

Enrich your media analytics with social data

Social media benchmarking
and competitive intelligence

Data streams & custom KPIs for advanced data science

AI, Image Analytics, Reporting Tools & more

Out-of-the-box integration with other data sources