What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is a field of data science and artificial intelligence that studies how computers and languages interact. The goal of NLP is to program a computer to understand human speech as it is spoken.
Current approaches to NLP are based on machine learning — i.e. examining patterns in natural language data, and using these patterns to improve a computer program’s language comprehension. Chatbots, smartphone personal assistants, search engines, banking applications, translation software, and many other business applications use natural language processing techniques to parse and understand human speech and written text.
And behind all of that savvy marketing, exceptional customer service, strategic upselling, etc., is one key component making it happen: big data integration.
Common use cases for natural language processing
One common application of NLP is a chatbot. If a user opens an online business chat to troubleshoot or ask a question, a computer responds in a manner that mimics a human. Sometimes the user doesn’t even know he or she is chatting with an algorithm.
That chatbot is trained using thousands of conversation logs, i.e. big data. A language processing layer in the computer system accesses a knowledge base (source content) and data storage (interaction history and NLP analytics) to come up with an answer. Big data and the integration of big data with machine learning allow developers to create and train a chatbot.
Natural language processing can also be used to process free form text and analyze the sentiment of a large group of social media users, such as Twitter followers, to determine whether the target group response is negative, positive, or neutral. The process is known as “sentiment analysis” and can easily provide brands and organizations with a broad view of how a target audience responded to an ad, product, news story, etc.
Other applications for NLP include:
- Extracting individuals’ names or company names from textual resources.
- Grouping forum discussions together by topics.
- Find discussions where people are mentioned but don't participate to the discussion.
- Linking entities.
Learn how to leverage Talend and Google to process free-form text using NLP
Natural language processing and Big Data
Natural language processing is built on big data, but the technology brings new capabilities and efficiencies to big data as well.
A simple example is log analysis and log mining. One common NLP technique is lexical analysis — the process of identifying and analyzing the structure of words and phrases. In computer sciences, it is better known as parsing or tokenization, and used to convert an array of log data into a uniform structure.
A more nuanced example is the increasing capabilities of natural language processing to glean business intelligence from terabytes of data. Traditionally, it is the job of a small team of experts at an organization to collect, aggregate, and analyze data in order to extract meaningful business insights. But those individuals need to know where to find the data they need, which keywords to use, etc. NLP is increasingly able to recognize patterns and make meaningful connections in data on its own.
How natural language processing works
Natural language processing deals with phonology (the study of the system of relationships among sounds in language) and morphology (the study of word forms and their relationships), and works by breaking down language into its component pieces.
The first step in NLP is to convert text into data using text analytics, which occurs at three levels:
- Syntax — What are the grammatical components of the given text?
- Semantics — What is the meaning of the given text?
- Pragmatics — What is the purpose of the text?
The next stage involves using NLP and natural language understanding (NLU) to analyze the structure and meaning of the data. A few approaches to NLP analysis are:
- Distributional Approach — Uses statistical tactics of machine learning to identify the meaning of a word by how it is used, such as part-of-speech tagging (Is this a noun or verb?) and semantic relatedness (different words that are used in similar ways).
- Frame-Based Approach — Uses a canonical presentation of sentences, represented inside the data structure (frame), to identify parts of the sentences that are syntactically different but semantically the same,
- Interactive Learning Approach — Uses dynamic, interactive environments where the user teaches the machine how to learn a language, step-by-step.
Four techniques used in NLP analysis
There are many approaches to natural language analysis — some very complex. Four fundamental, commonly used techniques in NLP analysis are:
- Lexical Analysis — Lexical analysis groups streams of letters or sounds from source code into basic units of meaning, called tokens. These tokens are then used by a language compiler to implement computer instructions, such as a chatbot responding to a question.
- Syntactic Analysis — Syntactic analysis is the process of analyzing words in a sentence for grammar, using a parsing algorithm, then arranging the words in a way that shows the relationship among them. Parsing algorithms break the words down into smaller parts—strings of natural language symbols—then analyze these strings of symbols to determine if they conform to a set of established grammatical rules.
- Semantic Analysis — Semantic analysis involves obtaining the meaning of a sentence, called the logical form, from possible parses of the syntax stage. It involves understanding the relationship between words, such as semantic relatedness — i.e. when different words are used in similar ways.
- Pragmatic Analysis — Pragmatic analysis is the process of discovering the meaning of a sentence based on context. It attempts to understand the ways humans produce and comprehend meaning from text or human speech. Pragmatic analysis in NLP would be the task of teaching a computer to understand the meaning of a sentence in different real-life situations.
NLP and the future of Big Data
Software applications using NLP and AI are expected to be a $5.4 billion market by 2025. The possibilities for both big data, and the industries it powers, are almost endless.
- The healthcare industry will continue to benefit dramatically from better natural language processing and data integration. Communications will be more efficient, errors will be reduced, and diagnoses will improve when healthcare software can understand and integrate hastily written or typed doctors notes, patients’ phone calls and voicemails, and more. In addition, Pharmaceutical companies and medical researchers often have tomes of text data in clinical trial information, patient notes, etc. Analyzing that information will be much more efficient with tools like automatic summarization.
- Law enforcement will benefit from a system that can understand and integrate language-turned-data from social media posts, criminal records, and anonymous phone calls and tips.
- Legal firms will benefit when pages and pages of legal documents, stenographer notes, testimonies, and/or police reports can be translated to data and easily summarized.
Specific NLP processes like automatic summarization — analyzing a large volume of text data and producing an executive summary — will be a boon to many industries, including some that may not have been considered “big data industries” until now.
And big data processes will, themselves, continue to benefit from improved NLP capabilities. So many data processes are about translating information from humans (language) to computers (data) for processing, and then translating it from computers (data) to humans (language) for analysis and decision making. As natural language processing continues to become more and more savvy, our big data capabilities can only become more and more sophisticated.
Getting started with NLP and Talend
NLP uses various analyses (lexical, syntactic, semantic, and pragmatic) to make it possible for computers to read, hear, and analyze language-based data. As a result, technologies such as chatbots are able to mimic human speech, and search engines are able to deliver more accurate results to users’ queries.
Talend Studio with machine learning on Spark can be used to teach a computer to understand how humans use natural language. To get started, download Talend Open Studio for Big Data.