Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering spaCy
Mastering spaCy

Mastering spaCy: An end-to-end practical guide to implementing NLP applications using the Python ecosystem

Arrow left icon
Profile Icon Duygu Altınok
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (15 Ratings)
Paperback Jul 2021 356 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Duygu Altınok
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (15 Ratings)
Paperback Jul 2021 356 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering spaCy

Chapter 1: Getting Started with spaCy

In this chapter, we will have a comprehensive introduction to natural language processing (NLP) application development with Python and spaCy. First, we will see how NLP development goes hand in hand with Python, along with an overview of what spaCy offers as a Python library.

After the warm-up, you will quickly get started with spaCy by downloading the library and loading the models. You will then explore spaCy's popular visualizer displaCy by visualizing several features of spaCy.

By the end of this chapter, you will know what you can achieve with spaCy and how to plan your journey with spaCy code. You will be also settled with your development environment, having already installed all the necessary packages for NLP tasks in the upcoming sections.

We're going to cover the following main topics in this chapter:

  • Overview of spaCy
  • Installing spaCy
  • Installing spaCy's statistical models
  • Visualization with displaCy

Technical requirements

The chapter code can be found at the book's GitHub repository: https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter01

Overview of spaCy

Before getting started with the spaCy code, we will first have an overview of NLP applications in real life, NLP with Python, and NLP with spaCy. In this section, we'll find out the reasons to use Python and spaCy for developing NLP applications. We will first see how Python goes hand-in-hand with text processing, then we'll understand spaCy's place in the Python NLP libraries. Let's start our tour with the close-knit relationship between Python and NLP.

Rise of NLP

Over the past few years, most of the branches of AI created a lot of buzz, including NLP, computer vision, and predictive analytics, among others. But just what is NLP? How can a machine or code solve human language?

NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. Human language is complicated – even a short paragraph contains references to the previous words, pointers to real-world objects, cultural references, and the writer's or speaker's personal experiences. Figure 1.1 shows such an example sentence, which includes a reference to a relative date (recently), phrases that can be resolved only by another person who knows the speaker (regarding the city that the speaker's parents live in) and who has general knowledge about the world (a city is a place where human beings live together):

Figure 1.1 – An example of human language, containing many cognitive and cultural aspects

Figure 1.1 – An example of human language, containing many cognitive and cultural aspects

How do we process such a complicated structure then? We have our weapons too; we model natural language with statistical models, and we process linguistic features to turn the text into a well-structured representation. This book provides all the necessary background and tools for you to extract the meaning out of text. By the end of this book, you will possess statistical and linguistic knowledge to process text by using a great tool – the spaCy library.

Though NLP gained popularity recently, processing human language has been present in our lives via many real-world applications, including search engines, translation services, and recommendation engines.

Search engines such as Google Search, Yahoo Search, and Microsoft Bing are an integral part of our daily lives. We look for homework help, cooking recipes, information about celebrities, the latest episodes of our favorite TV series; all sorts of information that we use in our daily lives. There is even a verb in English (also in many other languages), to google, meaning to look up some information on the Google search engine.

Search engines use advanced NLP techniques including mapping queries into a semantic space, where similar queries are represented by similar vectors. A quick trick is called autocomplete, where query suggestions appear on the search bar when we type the first few letters. Autocomplete looks tricky but indeed the algorithm is a combination of a search tree walk and character-level distance calculation. A past query is represented by a sequence of its characters, where each character corresponds to a node in the search tree. The arcs between the characters are assigned weights according to the popularity of this past query.

Then, when a new query comes, we compare the current query string to past queries by walking on the tree. A fundamental Computer Science (CS) data structure, the tree, is used to represent a list of queries, who would have thought that? Figure 1.2 shows a walk on the character tree:

Figure 1.2 – An autocomplete example

Figure 1.2 – An autocomplete example

This is a simplified explanation; the real algorithms blend several techniques usually. If you want to learn more about this subject, you can read the great articles about the data structures: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata and http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.

Continuing with search engines, search engines also know how to transform unstructured data to structured and linked data. When we type Diana Spencer into the search bar, this is what comes up:

Figure 1.3 – Search results for the query "Diana Spencer"

Figure 1.3 – Search results for the query "Diana Spencer"

How did the search engine link Diana Spencer to her well-known name Princess Diana? This is called entity linking. We link entities that mention the same real-world entity. Entity-linking algorithms concern representing semantic relations and knowledge in general. This area of NLP is called the Semantic Web. You can learn more about this at https://www.cambridgesemantics.com/blog/semantic-university/intro-semantic-web/. I worked as a knowledge engineer at a search engine company at the beginning of my career and really enjoyed it. This is a fascinating subject in NLP.

There is really no limit to what you can develop: search engine algorithms, chatbots, speech recognition applications, and user sentiment recognition applications. NLP problems are challenging yet fascinating. This book's mission is to provide you a toolbox with all the necessary tools. The first step of NLP development is choosing the programming language we will use wisely. In the next section, we will explain why Python is the weapon of choice. Let's move on to the next section to see the string bond of NLP and Python.

NLP with Python

As we remarked before, NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. As an industry professional, my first choice for manipulating text data is Python. In general, there are many benefits to using Python:

  • It is easy to read and looks very similar to pseudocode.
  • It is easy to produce and test code with.
  • It has a high level of abstraction.

Python is a great choice for developing NLP systems because of the following:

  • Simplicity: Python is easy to learn. You can focus on NLP rather than the programming language details.
  • Efficiency: It allows for easier development of quick NLP application prototypes.
  • Popularity: Python is one of the most popular languages. It has huge community support, and installing new libraries with pip is effortless.
  • AI ecosystem presence: A significant number of open source NLP libraries are available in Python. Many machine learning (ML) libraries such as PyTorch, TensorFlow, and Apache Spark also provide Python APIs.
  • Text methods: String and file operations with Python are effortless and straightforward. For example, splitting a sentence at the whitespaces requires only a one-liner, sentenc.split(), which can be quite painful in other languages, such as C++, where you have to deal with stream objects for this task.

When we put all the preceding points together, the following image appears – Python intersects with string processing, the AI ecosystem, and ML libraries to provide us the best NLP development experience:

Figure 1.4 – NLP with Python overview

Figure 1.4 – NLP with Python overview

We will use Python 3.5+ throughout this book. Users who do not already have Python installed can follow the instructions at https://realpython.com/installing-python/. We recommend downloading and using the latest version of Python 3.

In Python 3.x, the default encoding is Unicode, which means that we can use Unicode text without worrying much about the encoding. We won't go into details of encodings here, but you can think of Unicode as an extended set of ASCII, including more characters such as German-alphabet umlauts and the accented characters of the French alphabet. This way we can process German, French, and many more languages other than English.

Reviewing some useful string operations

In Python, the text is represented by strings, objects of the str class. Strings are immutable sequences of characters. Creating a string object is easy – we enclose the text in quotation marks:

 word = 'Hello World'

Now the word variable contains the string Hello World. As we mentioned, strings are sequences of characters, so we can ask for the first item of the sequence:

 print (word [0])
H

Always remember to use parentheses with print, since we are coding in Python 3.x. We can similarly access other indices, as long as the index doesn't go out of bounds:

 word [4]
'o'

How about string length? We can use the len method, just like with list and other sequence types:

 len(word)
11

We can also iterate over the characters of a string with sequence methods:

 for ch in word:
               print(ch)
H
e
l
l
o
W
o
r
l
d

Pro tip

Please mind the indentation throughout the book. Indentation in Python is the way we determine the control blocks and function definitions in general, and we will apply this convention in this book.

Now let's go over the more string methods such as counting characters, finding a substring, and changing letter case.

count counts the number of occurrences of a character in the string, so the output is 3 here:

 word.count('l')
3

Often, you need to find the index of a character for a number of substring operations such as cutting and slicing the string:

 word.index(e)
1

Similarly, we can search for substrings in a string with the find method:

 word.find('World')
6

find returns –1 if the substring is not in the string:

 word.find('Bonjour')
-1

Searching for the last occurrence of a substring is also easy:

 word.rfind('l')
9

We can change letter case by the upper and lower methods:

 word.upper()
'HELLO WORLD'

The upper method changes all characters to uppercase. Similarly, the lower method changes all characters to lowercase:

 word.lower()
'hello world'

The capitalize method capitalizes the first character of the string:

 'hello madam'.capitalize()
'Hello madam'

The title method makes the string title case. Title case literally means to make a title, so each word of the string is capitalized:

 'hello madam'.title()
'Hello Madam'

Forming new strings from other strings can be done in several ways. We can concatenate two strings by adding them:

 'Hello Madam!' + 'Have a nice day.'
'Hello Madam!Have a nice day.'

We can also multiply a string with an integer. The output will be the string concatenated to itself by the number of times specified by the integer:

 'sweet ' * 5
'sweet sweet sweet sweet '

join is a frequently used method; it takes a list of strings and joins them into one string:

 ' '.join (['hello', 'madam'])
'hello madam'

There is a variety of substring methods. Replacing a substring means changing all of its occurrences with another string:

 'hello madam'.replace('hello', 'good morning')
'good morning madam'

Getting a substring by index is called slicing. You can slice a string by specifying the start index and end index. If we want only the second word, we can do the following:

 word = 'Hello Madam Flower'
 word [6:11]
'Madam'

Getting the first word is similar. Leaving the first index blank means the index starts from zero:

 word [:5]
'Hello'

Leaving the second index blank has a special meaning as well – it means the rest of the string:

 word [12:]
'Flower'

We now know some of the Pythonic NLP operations. Now we can dive into more of spaCy.

Getting a high-level overview of the spaCy library

spaCy is an open source Python library for modern NLP. The creators of spaCy describe their work as industrial-strength NLP, and as a contributor I can assure you it is true. spaCy is shipped with pretrained language models and word vectors for 60+ languages.

spaCy is focused on production and shipping code, unlike its more academic predecessors. The most famous and frequently used Python predecessor is NLTK. NLTK's main focus was providing students and researchers an idea of language processing. It never put any claims on efficiency, model accuracy, or being an industrial-strength library. spaCy focused on providing production-ready code from the first day. You can expect models to perform on real-world data, the code to be efficient, and the ability to process a huge amount of text data in a reasonable time. The following table is an efficiency comparison from the spaCy documentation (https://spacy.io/usage/facts-figures#speed-comparison):

Figure 1.5 – A speed comparison of spaCy and other popular NLP frameworks

Figure 1.5 – A speed comparison of spaCy and other popular NLP frameworks

The spaCy code is also maintained in a professional way, with issues sorted by labels and new releases covering as many fixes as possible. You can always raise an issue on the spaCy GitHub repo at https://github.com/explosion/spaCy, report a bug, or ask for help from the community.

Another predecessor is CoreNLP (also known as StanfordNLP). CoreNLP is implemented in Java. Though CoreNLP competes in terms of efficiency, Python won by easy prototyping and spaCy is much more professional as a software package. The code is well maintained, issues are tracked on GitHub, and every issue is marked with some labels (such as bug, feature, new project). Also, the installation of the library code and the models is easy. Together with providing backward compatibility, this makes spaCy a professional software project. Here is a detailed comparison from the spaCy documentation at https://spacy.io/usage/facts-figures#comparison:

Figure 1.6 – A feature comparison of spaCy, NLTK, and CoreNLP

Figure 1.6 – A feature comparison of spaCy, NLTK, and CoreNLP

Throughout this book, we will be using spaCy's latest release v3.1 (the version used at the time of writing this book) for all our computational linguistics and ML purposes. The following are the features in the latest release:

  • Original data preserving tokenization.
  • Statistical sentence segmentation.
  • Named entity recognition.
  • Part-of-Speech (POS) tagging.
  • Dependency parsing.
  • Pretrained word vectors.
  • Easy integration with popular deep learning libraries. spaCy's ML library Thinc provides thin wrappers around PyTorch, TensorFlow, and MXNet. spaCy also provides wrappers for HuggingFace Transformers by spacy-transformers library. We'll see more of the Transformers in Chapter 9, spaCy and Transformers.
  • Industrial-level speed.
  • A built-in visualizer, displaCy.
  • Support for 60+ languages.
  • 46 state-of-the-art statistical models for 16 languages.
  • Space-efficient string data structures.
  • Efficient serialization.
  • Easy model packaging and usage.
  • Large community support.

We had a quick glance around spaCy as an NLP library and as a software package. We will see what spaCy offers in detail throughout the book.

Tips for the reader

This book is a practical guide. In order to get the most out of the book, I recommend readers replicate the code in their own Python shell. Without following and performing the code, it is not possible to get a proper understanding of NLP concepts and spaCy methods, which is why we have arranged the upcoming chapters in the following way:

  • Explanation of the language/ML concept
  • Application code with spaCy
  • Evaluation of the outcome
  • Challenges of the methodology
  • Pro tips and tricks to overcome the challenges

Installing spaCy

Let's get started by installing and setting up spaCy. spaCy is compatible with 64-bit Python 2.7 and 3.5+, and can run on Unix/Linux, macOS/OS X, and Windows. CPython is a reference implementation of Python in C. If you already have Python running on your system, most probably your CPython modules are fine too – hence you don't need to worry about this detail. The newest spaCy releases are always downloadable via pip (https://pypi.org/) and conda (https://conda.io/en/latest/). pip and conda are two of the most popular distribution packages.  

pip is the most painless choice as it installs all the dependencies, so let's start with it.

Installing spaCy with pip

You can install spaCy with the following command:

$ pip install spacy

If you have more than one Python version installed in your system (such as Python 2.8, Python 3.5, Python 3.8, and so on), then select the pip associated with Python you want to use. For instance, if you want to use spaCy with Python 3.5, you can do the following:

$ pip3.5 install spacy

If you already have spaCy installed on your system, you may want to upgrade to the latest version of spaCy. We're using spaCy 3.1 in this book; you can check which version you have with the following command:

$ python –m spacy info

This is how a version info output looks like. This has been generated with the help of my Ubuntu machine:

Figure 1.7 – An example spaCy version output

Figure 1.7 – An example spaCy version output

Suppose you want to upgrade your spaCy version. You can upgrade your spaCy version to the latest available version with the following command:

$ pip install –U spacy

Installing spaCy with conda

conda support is provided by the conda community. The command for installing spaCy with conda is as follows:

$ conda install -c conda-forge spacy 

Installing spaCy on macOS/OS X

macOS and OS X already ship with Python. You only need to install a recent version of the Xcode IDE. After installing Xcode, please run the following:

$ xcode-select –install

This installs the command-line development tools. Then you will be able to follow the preceding pip commands.

Installing spaCy on Windows

If you have a Windows system, you need to install a version of Visual C++ Build Tools or Visual Studio Express that matches your Python distribution. Here are the official distributions and their matching versions, taken from the spaCy installation guide (https://spacy.io/usage#source-windows):

Figure 1.8 – Visual Studio and Python distribution compatibility table

Figure 1.8 – Visual Studio and Python distribution compatibility table

If you didn't encounter any problems so far, then that means spaCy is installed and running on your system. You should be able to import spaCy into your Python shell:

 import spacy

Now you successfully installed spaCy – congrats and welcome to the spaCy universe! If you have installation problems please continue to the next section, otherwise you can move on to language model installation.

Troubleshooting while installing spaCy

There might be cases where you get issues popping up during the installation process. The good news is, we're using a very popular library so most probably other developers have already encountered the same issues. Most of the issues are listed on Stack Overflow (https://stackoverflow.com/questions/tagged/spacy) and the spaCy GitHub Issues section (https://github.com/explosion/spaCy/issues) already. However, in this section, we'll go over the most common issues and their solutions.

Some of the most common issues are as follows:

  • The Python distribution is incompatible: In this case please upgrade your Python version accordingly and then do a fresh installation.
  • The upgrade broke spaCy: Most probably there are some leftover packages in your installation directories. The best solution is to first remove the spaCy package completely by doing the following:
     pip uninstall spacy

    Then do a fresh installation by following the installation instructions mentioned.

  • You're unable to install spaCy on a Mac: On a Mac, please make sure that you don't skip the following to make sure you correctly installed the Mac command-line tools and enabled pip:
    $ xcode-select –install

In general, if you have the correct Python dependencies, the installation process will go smoothly.

We're all set up and ready for our first usage of spaCy, so let's go ahead and start using spaCy's language models.

Installing spaCy's statistical models

The spaCy installation doesn't come with the statistical language models needed for the spaCy pipeline tasks. spaCy language models contain knowledge about a specific language collected from a set of resources. Language models let us perform a variety of NLP tasks, including POS tagging and named-entity recognition (NER).

Different languages have different models and are language specific. There are also different models available for the same language. We'll see the differences between those models in detail in the Pro tip at the end of this section, but basically the training data is different. The underlying statistical algorithm is the same. Some of the currently supported languages are as follows:

Figure 1.9 – spaCy models overview

Figure 1.9 – spaCy models overview

The number of supported languages grows rapidly. You can follow the list of supported languages on the spaCy Models and Languages page (https://spacy.io/usage/models#languages).

Several pretrained models are available for different languages. For English, the following models are available for download: en_core_web_sm, en_core_web_md, and en_core_web_lg. These models use the following naming convention:

  • Language: Indicates the language code: en for English, de for German, and so on.
  • Type: Indicates the model capability. For instance, core means a general-purpose model for the vocabulary, syntax, entities, and vectors.
  • Genre: The type of text the model recognizes. The genre can be web (Wikipedia), news (news, media) Twitter, and so on.
  • Size: Indicates the model size: lg for large, md for medium, and sm for small.

Here is what a typical language model looks like:

Figure 1.10 – The small-sized spaCy English web model

Figure 1.10 – The small-sized spaCy English web model

Large models can require a lot of disk space, for example en_core_web_lg takes up 746 MB, while en_core_web_md needs 48MB and en_core_web_sm takes only 11MB. Medium-sized models work well for many development purposes, so we'll use the English md model throughout the book.

Pro tip

It is a good practice to match model genre to your text type. We recommend picking the genre as close as possible to your text. For example, the vocabulary in the social media genre will be very different from that in the Wikipedia genre. You can pick the web genre if you have social media posts, newspaper articles, financial news – that is, more language from daily life. The Wikipedia genre is suitable for rather formal articles, long documents, and technical documents. In case you are not sure which genre is the most suitable, you can download several models and test some example sentences from your own corpus and see how each model performs.

Now that we're well-informed about how to choose a model, let's download our first model.

Installing language models

Since v1.7.0, spaCy offers a great benefit: installing the models as Python packages. You can install spaCy models just like any other Python module and make them a part of your Python application. They're properly versioned, so they can go into your requirements.txt file as a dependency. You can install the models from a download URL or a local director manually, or via pip. You can put the model data anywhere on your local filesystem.

You can download a model via spaCy's download command. download looks for the most compatible model for your spaCy version, and then downloads and installs it. This way you don't need to bother about any potential mismatch between the model and your spaCy version. This is the easiest way to install a model:

$ python -m spacy download en_core_web_md

The preceding command selects and downloads the most compatible version of this specific model for your local spaCy version.

To download the exact model version, the following is what needs to be done (though you often don't need it):

$ python -m spacy download en_core_web_lg-2.0.0 --direct

The download command deploys pip behind the scenes. When you make a download, pip installs the package and places it in your site-packages directory just as any other installed Python package.

After the download, we can load the packages via spaCy's load () method.

This is what we did so far:

$ pip install spacy
$ python -m spacy download en_core_web_md
 import spacy
 nlp = spacy.load('en_core_web_md')
 doc = nlp('I have a ginger cat.')

We can also download models via pip:

  1. First, we need the link to the model we want to download.
  2. We navigate to the model releases (https://github.com/explosion/spacy-models/releases), find the model, and copy the archive file link.
  3. Then, we do a pip install with the model link.

Here is an example command for downloading with a custom URL:

$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz

You can install a local file as follows:

$ pip install /Users/yourself/en_core_web_lg-2.0.0.tar.gz

This installs the model into your site-packages directory. Then we run spacy.load() to load the model via its package name, create a shortcut link to give it a custom name (usually a shorter name), or import it as a module.

Importing the language model as a module is also possible:

 import en_core_web_md
 nlp = en_core_web_md.load()
 doc = nlp('I have a ginger cat.')

Pro tip

In professional software development, we usually download models as part of an automated pipeline. In this case, it's not feasible to use spaCy's download command; rather, we use pip with the model URL. You can add the model into your requirements.txt file as a package as well.

How you like to load your models is your choice and also depends on the project requirements you're working on.

At this point, we're ready to explore the spaCy world. Let's now learn about spaCy's powerful visualization tool, displaCy.

Visualization with displaCy

Visualization is an important tool that should be in every data scientist's toolbox. Visualization is the easiest way to explain some concepts to your colleagues, your boss, and any technical or non-technical audience. Visualization of language data is specifically useful and allows you to identify patterns in your data at a glance.

There are many Python libraries and plugins such as Matplotlib, seaborn, TensorBoard, and so on. Being an industrial library, spaCy comes with its own visualizer – displaCy. In this subsection, you'll learn how to spin up a displaCy server on your machine, in a Jupyter notebook, and in a web application. You'll also learn how to export the graphics you created as an image file, customize your visualizations, and make manual annotations without creating a Doc object. We'll start by exploring the easiest way – using displaCy's interactive demo.

Getting started with displaCy

Go ahead and navigate to https://explosion.ai/demos/displacy to use the interactive demo. Enter your text in the Text to parse box and then click the search icon on the right to generate the visualization. The result might look like the following:

Figure 1.11 – displaCy's online demo

Figure 1.11 – displaCy's online demo

The visualizer performs two syntactic parses, POS tagging, and a dependency parse, on the submitted text to visualize the sentence's syntactic structure. Don't worry about how POS tagging and dependency parsing work, as we'll explore them in the upcoming chapters. For now, just think of the result as a sentence structure.

You'll notice two tick boxes, Merge Punctuation and Merge Phrases. Merging punctuation merges the punctuation tokens into the previous token and serves a more compact visualization (it works like a charm on long documents).

The second option, Merge Phrases, again gives more compact dependency trees. This option merges adjectives and nouns into one phrase; if you don't merge, then adjectives and nouns will be displayed individually. This feature is useful for visualizing long sentences with many noun phrases. Let's see the difference with an example sentence: They were beautiful and healthy kids with strong appetites. It contains two noun phrases, beautiful and healthy kids and strong appetite. If we merge them, the result is as follows:

Figure 1.12 – An example parse with noun phrases merged

Figure 1.12 – An example parse with noun phrases merged

Without merging, every adjective and noun are shown individually:

Figure 1.13 – A parse of the same sentence, unmerged

Figure 1.13 – A parse of the same sentence, unmerged

The second parse is a bit too cumbersome and difficult to read. If you work on a text with long sentences such as law articles or Wikipedia entries, we definitely recommend merging.

You can choose a statistical model from the Model box on the right for the currently supported languages. This option allows you to play around with the language models without having to download and install them on your local machine.

Entity visualizer

displaCy's entity visualizer highlights the named entities in your text. The online demo lives at https://explosion.ai/demos/displacy-ent/. We didn't go through named entities yet, but you can think of them as proper nouns for important entities such as people's names, company names, dates, city and country names, and so on. Extracting entities will be covered in Chapter 3, Linguistic Features, and Chapter 4, Rule-Based Matching, in detail.

The online demo works similar to the syntactic parser demo. Enter your text into the textbox and hit the search button. Here is an example:

Figure 1.14 – An example entity visualization

Figure 1.14 – An example entity visualization

The right side contains tick boxes for entity types. You can tick the boxes that match your text type such as, for instance, MONEY and QUANTITY for a financial text. Again, just like in the syntactic parser demo, you can choose from the available models.

Visualizing within Python

With the introduction of the latest version of spaCy, the displaCy visualizers are integrated into the core library. This means that you can start using displaCy immediately after installing spaCy on your machine! Let's go through some examples.

The following code segment is the easiest way to spin up displaCy on your local machine:

 import spacy
 from spacy import displacy
 nlp = spacy.load('en_core_web_md')
 doc= nlp('I own a ginger cat.')
 displacy.serve(doc, style='dep')

As you can see from the preceding snippet, the following is what we did:

  1. We import spaCy.
  2. Following that, we import displaCy from the core library.
  3. We load the English model that we downloaded in the Installing spaCy's statistical models section.
  4. Once it is loaded, we create a Doc object to pass to displaCy.
  5. We then started the displaCy web server via calling serve().
  6. We also passed dep to the style parameter to see the dependency parsing result.

After firing up this code, you should see a response from displaCy as follows:

Figure 1.15 – Firing up displaCy locally

Figure 1.15 – Firing up displaCy locally

The response is added along with a link, http://0.0.0.0:5000, this is the local address where displaCy renders your graphics. Please click the link and navigate to the web page. You should see the following:

Figure 1.16 – View the result visualization in your browser

Figure 1.16 – View the result visualization in your browser

This means that displaCy generated a dependency parse result visualization and rendered it on your localhost. After you're finished with displaying the visual and you want to shut down the server, you can press Ctrl +C to shut down the displaCy server and go back to the Python shell:

Figure 1.17 – Shutting down the displaCy server

Figure 1.17 – Shutting down the displaCy server

After shutting down, you won't be able to visualize more examples, but you'll continue seeing the results you already generated.

If you wish to use another port or you get an error because the port 5000 is already in use, you can use the port parameter of displaCy with another port number. Replacing the last line of the preceding code block with the following line will suffice:

displacy.serve(doc, style='dep', port= '5001')

Here, we provide the port number 5001 explicitly. In this case, displaCy will render the graphics on http://0.0.0.0:5001.

Generating an entity recognizer is done similarly. We pass ent to the style parameter instead of dep:

 import spacy
 from spacy import displacy
 nlp = spacy.load('en_core_web_md')
 doc= nlp('Bill Gates is the CEO of Microsoft.')
 displacy.serve(doc, style='ent')

The result should look like the following:

Figure 1.18 – The entity visualization is displayed on your browser

Figure 1.18 – The entity visualization is displayed on your browser

Let's move on to other platforms we can use for displaying the results.

Using displaCy in Jupyter notebooks

Jupyter notebook is an important part of daily data science work. Fortunately, displaCy can spot whether you're currently coding in a Jupyter notebook environment and returns markup that can be directly displayed in a cell.

If you don't have Jupyter notebook installed on your system but wish to use it, you can follow the instructions at https://test-jupyter.readthedocs.io/en/latest/install.html.

This time we'll call render() instead of serve(). The rest of the code is the same. You can type/paste the following code into your Jupyter notebook:

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_md')
doc= nlp('Bill Gates is the CEO of Microsoft.')
displacy.render(doc, style='dep')

The result should look like the following:

Figure 1.19 – displaCy rendering results in a Jupyter notebook

Figure 1.19 – displaCy rendering results in a Jupyter notebook

Exporting displaCy graphics as an image file

Often, we need to export the graphics that we generated with displaCy as image files to place them into presentations, articles, or papers. We can call displaCy in this case as well:

 import spacy
 from spacy import displacy
 from pathlib import Path
 nlp = spacy.load('en_core_web_md')
 doc = nlp('I'm a butterfly.')
 svg = displacy.render(doc, style='dep', jupyter=False)
 filename = 'butterfly.svg'
  output_path = Path ('/images/' + file_name)
 output_path.open('w', encoding='utf-8').write(svg)

We import spaCy and displaCy. We load the English language model, then create a Doc object as usual. Then we call displacy.render() and capture the output to the svg variable. The rest is writing the svg variable to a file called butterfly.svg.

We have reached the end of the visualization chapter here. We created good-looking visuals and learned the details of creating visuals with displaCy. If you wish to find out how to use different background images, background colors, and fonts, you can visit the displaCy documentation at http://spacy.io/usage/visualizers.

Often, we need to create visuals with different colors and styling, and the displaCy documentation contains detailed information about styling. The documentation also includes how to embed displaCy into your web applications. spaCy is well documented as a project and the documentation contains everything we need!

Summary

This chapter gave you an introduction to NLP with Python and spaCy. You now have a brief idea about why to use Python for language processing and the reasons to prefer spaCy for creating your NLP applications. We also got started on our spaCy journey by installing spaCy and downloading language models. This chapter also introduced us to the visualization tool – displaCy.

In the next chapter, we will continue our exciting spaCy journey with spaCy core operations such as tokenization and lemmatization. It'll be our first encounter with spaCy features in detail. Let's go ahead and explore more together!

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Gain an overview of what spaCy offers for natural language processing
  • Learn details of spaCy's features and how to use them effectively
  • Work through practical recipes using spaCy

Description

spaCy is an industrial-grade, efficient NLP Python library. It offers various pre-trained models and ready-to-use features. Mastering spaCy provides you with end-to-end coverage of spaCy's features and real-world applications. You'll begin by installing spaCy and downloading models, before progressing to spaCy's features and prototyping real-world NLP apps. Next, you'll get familiar with visualizing with spaCy's popular visualizer displaCy. The book also equips you with practical illustrations for pattern matching and helps you advance into the world of semantics with word vectors. Statistical information extraction methods are also explained in detail. Later, you'll cover an interactive business case study that shows you how to combine all spaCy features for creating a real-world NLP pipeline. You'll implement ML models such as sentiment analysis, intent recognition, and context resolution. The book further focuses on classification with popular frameworks such as TensorFlow's Keras API together with spaCy. You'll cover popular topics, including intent classification and sentiment analysis, and use them on popular datasets and interpret the classification results. By the end of this book, you'll be able to confidently use spaCy, including its linguistic features, word vectors, and classifiers, to create your own NLP apps.

Who is this book for?

This book is for data scientists and machine learners who want to excel in NLP as well as NLP developers who want to master spaCy and build applications with it. Language and speech professionals who want to get hands-on with Python and spaCy and software developers who want to quickly prototype applications with spaCy will also find this book helpful. Beginner-level knowledge of the Python programming language is required to get the most out of this book. A beginner-level understanding of linguistics such as parsing, POS tags, and semantic similarity will also be useful.

What you will learn

  • Install spaCy, get started easily, and write your first Python script
  • Understand core linguistic operations of spaCy
  • Discover how to combine rule-based components with spaCy statistical models
  • Become well-versed with named entity and keyword extraction
  • Build your own ML pipelines using spaCy
  • Apply all the knowledge you ve gained to design a chatbot using spaCy

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 09, 2021
Length: 356 pages
Edition : 1st
Language : English
ISBN-13 : 9781800563353
Category :
Languages :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jul 09, 2021
Length: 356 pages
Edition : 1st
Language : English
ISBN-13 : 9781800563353
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 158.97
Python Natural Language Processing Cookbook
$54.99
Mastering Transformers
$54.99
Mastering spaCy
$48.99
Total $ 158.97 Stars icon
Banner background image

Table of Contents

14 Chapters
Section 1: Getting Started with spaCy Chevron down icon Chevron up icon
Chapter 1: Getting Started with spaCy Chevron down icon Chevron up icon
Chapter 2: Core Operations with spaCy Chevron down icon Chevron up icon
Section 2: spaCy Features Chevron down icon Chevron up icon
Chapter 3: Linguistic Features Chevron down icon Chevron up icon
Chapter 4: Rule-Based Matching Chevron down icon Chevron up icon
Chapter 5: Working with Word Vectors and Semantic Similarity Chevron down icon Chevron up icon
Chapter 6: Putting Everything Together: Semantic Parsing with spaCy Chevron down icon Chevron up icon
Section 3: Machine Learning with spaCy Chevron down icon Chevron up icon
Chapter 7: Customizing spaCy Models Chevron down icon Chevron up icon
Chapter 8: Text Classification with spaCy Chevron down icon Chevron up icon
Chapter 9: spaCy and Transformers Chevron down icon Chevron up icon
Chapter 10: Putting Everything Together: Designing Your Chatbot with spaCy Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(15 Ratings)
5 star 66.7%
4 star 13.3%
3 star 13.3%
2 star 0%
1 star 6.7%
Filter icon Filter
Top Reviews

Filter reviews by




brian chan Jul 23, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Clear and concise presentation of NLP concepts and Spacy for beginner. Thoroughly enjoyed the working codes I can follow along.
Feefo Verified review Feefo
Kai Jul 20, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As someone who is only brushing the surface of what NLP can do, this is a great book to get your proverbial toes wet.Bear in mind that you should have some working knowledge of how to code before you get into it. But once you have your foundations set, Duygu's book is a breeze to follow along and get a working model up and running in no time!I can't wait to homebrew with this newly found knowledge.
Amazon Verified review Amazon
Amazon Customer Oct 26, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I came across Mastering spaCy when I was looking for references for my AI project which involves the use of transformers. I have extensive experience in many aspects of data science, however, NLP has always been an area I want to improve on. As I am reading this book, I can’t help but realize it has gradually become my favorite reference whenever I come across an AI project which more or less involves the implementation of NLP. I like this book for that it is not merely a textbook to the spaCy python library, but also an elaboration to the most commonly seen terms in NLP, such as POS tagging, tokenization, lemmatization, etc, with examples. Thus, reading this book allows me to learn about spaCy and linguistics simultaneously.To gain the most out of this book, I suggest that you prepare a vocabulary list and take notes on the terms appearing in this book that are intuitive to you. You should really take your time to taste the first 6 chapters of Mastering spaCy as it walks you through the general process of completing an NLP project, the concepts (NER and information extraction, etc) you might come across, and the functions in spaCy that apply these concepts.Additionally, I want to express my appreciation to the author, Miss Duygu Altinok, that I particularly enjoy the hands-on practice in chapter 8 and 9 as they are relevant to the project I am currently working on. I am able to borrow part of the codes from these two chapters to improve my own transformer. I strongly recommend this book to everyone who struggles with NLP. It influences me, and I believe it will influence you too!
Amazon Verified review Amazon
Zhenya Oct 25, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Mastering spaCy is a great book for those who want to work with spaCy. It is very thorough in its explanations of spaCy's features, including rule-based and machine learning tasks, and it is suitable for a beginner. Experts will likewise this book very useful as a reference in their daily work. It describes all the important features of spaCy, such as part-of-speech tagging, syntactic and semantic parsing, named entity recognition, word vectors, building and updating machine learning models, using deep learning and transformers, and a complete chatbot put together using spaCy. Overall, an excellent book for the NLP practitioner.
Amazon Verified review Amazon
Emanuel Di Nardo Jul 13, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
To be honest, I'm liking this book a lot.It starts with a little introduction to the rise of NLP and of course SpaCy installation.It explains basic NLP concepts like tokenization, POS tagging, dependency parsing, Named Entity Recognition. Everything with a lot of python examples.It covers concepts like word embeddings, and more in general word numerical representation algorithms.The book covers also data annotation with tools like Brat and of course text classification with SpaCy and Keras.It has a section on Bert and Chatbots as well.Everything with a lot of code examples.I found this book very readable and complete.I recommend it.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.