Natural Language Processing and How ML Models Understand Text

07/29/22 • 58 min

1 Listener

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

00:00:00 – Introduction
00:02:47 – Exploring the topic
00:06:00 – Perceived sentience of LaMDA
00:10:24 – How do we get started?
00:11:16 – What are classification and sentiment analysis?
00:13:03 – Transforming text in rows and columns
00:14:47 – Sponsor: Snyk
00:15:27 – Bag-of-words approach
00:19:12 – Stemming and lemmatization
00:22:05 – Capturing N-grams
00:25:34 – Count vectorization
00:27:14 – Stop words
00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
00:32:28 – Potential projects for bag-of-words techniques
00:34:07 – Video Course Spotlight
00:35:20 – WordNet and NLTK package
00:37:27 – Word embeddings and word2vec
00:45:30 – Previous training and too many dimensions
00:50:07 – How to use word2vec and Gensim?
00:51:26 – What types of projects for word2vec and Gensim?
00:54:41 – Getting into GPT and BERT in another episode
00:56:11 – How to follow Jodie’s work?
00:57:36 – Thanks and goodbye

Show Links:

Course Spotlight: Learn Text Classification With Python and Keras

Topics:

00:00:00 – Introduction
00:02:47 – Exploring the topic
00:06:00 – Perceived sentience of LaMDA
00:10:24 – How do we get started?
00:11:16 – What are classification and sentiment analysis?
00:13:03 – Transforming text in rows and columns
00:14:47 – Sponsor: Snyk
00:15:27 – Bag-of-words approach
00:19:12 – Stemming and lemmatization
00:22:05 – Capturing N-grams
00:25:34 – Count vectorization
00:27:14 – Stop words
00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
00:32:28 – Potential projects for bag-of-words techniques
00:34:07 – Video Course Spotlight
00:35:20 – WordNet and NLTK package
00:37:27 – Word embeddings and word2vec
00:45:30 – Previous training and too many dimensions
00:50:07 – How to use word2vec and Gensim?
00:51:26 – What types of projects for word2vec and Gensim?
00:54:41 – Getting into GPT and BERT in another episode
00:56:11 – How to follow Jodie’s work?
00:57:36 – Thanks and goodbye

Show Links:

Previous Episode

Creating Documentation With MkDocs & When to Use a Python dict

How do you start building your project documentation? What if you had a tool that could do the heavy lifting and automatically write large portions directly from your code? This week on the show, Christopher Trudeau is here, bringing another batch of PyCoder’s Weekly articles and projects.

We talk about a Real Python step-by-step project from Martin Breuss about MkDocs. The project walks you through generating nice-looking and modern documentation from Markdown files and your existing code’s docstrings. The final step is to deploy your freshly generated documentation to a GitHub repository.

Christopher talks about a pair of articles arguing for and against using Python dictionaries. The first article, “Just Use Dictionaries,” pushes to keep things simple, while the second article, “Don’t Let Dicts Spoil Your Code,” contends that complex projects require something more specific.

We cover several other articles and projects from the Python community, including discussing the recent beta release of Python 3.11, 2FA for PyPI, procedural music composition with arvo, building a tic-tac-toe game with Python and Tkinter, common issues encountered while coding in Python, a type-safe library to generate SVG files, and a lightweight static analysis tool for your projects.

Course Spotlight: Dictionaries and Arrays: Selecting the Ideal Data Structure

In this course, you’ll learn about two of Python’s data structures: dictionaries and arrays. You’ll look at multiple types and classes for both of these and learn which implementations are best for your specific use cases.

Topics:

00:00:00 – Introduction
00:02:39 – Python 3.11 Release May Be Delayed
00:03:39 – The cursed release of Python 3.11.0b4 is now available
00:05:01 – PyPI 2FA Security Key Giveaway
00:08:01 – Build Your Python Project Documentation With MkDocs
00:14:12 – Don’t Let Dicts Spoil Your Code
00:16:22 – Just Use Dictionaries
00:20:12 – Sponsor: Snyk.io
00:20:51 – Procedural Music Composition With arvo
00:29:10 – Build a Tic-Tac-Toe Game With Python and Tkinter
00:33:59 – Video Course Spotlight
00:35:35 – Most Common Issue You Have Coding With Python?
00:45:00 – svg.py: Type-Safe Library to Generate SVG Files
00:48:27 – semgrep: Lightweight Static Analysis for Many Languages
00:53:46 – Thanks and goodbye

News:

Topic Links:

Build Your Python Project Documentation With MkDocs – In this tutorial, you’ll learn how to build professional documentation for a Python package using MkDocs and mkdocstrings. These tools allow you to generate nice-looking and modern documentation from Markdown files and, more importantly, from your code’s docstrings.
Don’t Let Dicts Spoil Your Code – The dict is the go-to data structure for Python programmers, but its loose relationship to the data can be problematic in large data streams. Learn more about why and when you might choose a different data structure.
Just Use Dictionaries – Using simple data structures is an important part of keeping it simple, and Python is all about simplicity. Less code means fewer problems. Just use dictionaries. You probably don’t need classes.
Procedural Music Composition With arvo – By using the music21 and avro libraries, you can create musical scores programmatically. This article runs you through which libraries you need and how you can compose your own music.
Build a Tic-Tac-Toe Game With Python and Tkinter – In this step-by-step project, you’ll learn how to create a tic-tac-toe game using Python and the Tkinter GUI framework. Tkinter is cross-platform and is available in the Python standard library. Creating a game in Python is a great and fun way to learn something new and e...

Next Episode

Inspiring Young People to Learn Python With Mission Encodeable

Is there someone in your life you’d like to inspire to learn Python? Mission Encodeable is a website designed to teach people to code, built by two high-school students. This week on the show, Anna and Harry Wake talk about creating their site and motivating people to start coding.

We discuss why they decided to build the site. Anna and Harry initially felt that the site would be for other students but soon realized it could be helpful for anyone interested in starting to code in Python. We cover the project-based approach and how they implemented the interactive browser-based tool replit.com.

We talk about learning Python in the classroom and how they found additional books and tutorials to supplement their coding education. Anna and Harry also created a resource hub to help teachers take advantage of the site.

Course Spotlight: Rock, Paper, Scissors With Python: A Command Line Game

In this course, you’ll learn to program rock paper scissors in Python from scratch. You’ll learn how to take in user input, make the computer choose a random action, determine a winner, and split your code into functions.

Topics: