
Natural Language Processing and How ML Models Understand Text
07/29/22 • 58 min
1 Listener
How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.
Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.
We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.
Course Spotlight: Learn Text Classification With Python and Keras
In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.
Topics:
- 00:00:00 – Introduction
- 00:02:47 – Exploring the topic
- 00:06:00 – Perceived sentience of LaMDA
- 00:10:24 – How do we get started?
- 00:11:16 – What are classification and sentiment analysis?
- 00:13:03 – Transforming text in rows and columns
- 00:14:47 – Sponsor: Snyk
- 00:15:27 – Bag-of-words approach
- 00:19:12 – Stemming and lemmatization
- 00:22:05 – Capturing N-grams
- 00:25:34 – Count vectorization
- 00:27:14 – Stop words
- 00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
- 00:32:28 – Potential projects for bag-of-words techniques
- 00:34:07 – Video Course Spotlight
- 00:35:20 – WordNet and NLTK package
- 00:37:27 – Word embeddings and word2vec
- 00:45:30 – Previous training and too many dimensions
- 00:50:07 – How to use word2vec and Gensim?
- 00:51:26 – What types of projects for word2vec and Gensim?
- 00:54:41 – Getting into GPT and BERT in another episode
- 00:56:11 – How to follow Jodie’s work?
- 00:57:36 – Thanks and goodbye
Show Links:
- Why Google’s “sentient” AI LaMDA is nothing like a person.
- On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium
- ELIZA - Wikipedia
- eliza.py - Python 2 version by Daniel Connelly
- dabraude/Pyliza: Python3 Implementation of Eliza
- magneticpoetry.com
- Natural Language Processing With Python’s NLTK Package – Real Python
- Practical Text Classification With Python and Keras – Real Python
- Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python
- NLTK: Natural Language Toolkit
- spaCy · Industrial-strength Natural Language Processing in Python
- Natural Language Processing With spaCy in Python - Real Python
- Stemming - Wikipedia
- Lemmatization - Wikipedia
- Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn
- TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn
- Porter Stemmer: nltk.stem.porter module — NLTK
- Snowbal...
How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.
Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.
We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.
Course Spotlight: Learn Text Classification With Python and Keras
In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.
Topics:
- 00:00:00 – Introduction
- 00:02:47 – Exploring the topic
- 00:06:00 – Perceived sentience of LaMDA
- 00:10:24 – How do we get started?
- 00:11:16 – What are classification and sentiment analysis?
- 00:13:03 – Transforming text in rows and columns
- 00:14:47 – Sponsor: Snyk
- 00:15:27 – Bag-of-words approach
- 00:19:12 – Stemming and lemmatization
- 00:22:05 – Capturing N-grams
- 00:25:34 – Count vectorization
- 00:27:14 – Stop words
- 00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
- 00:32:28 – Potential projects for bag-of-words techniques
- 00:34:07 – Video Course Spotlight
- 00:35:20 – WordNet and NLTK package
- 00:37:27 – Word embeddings and word2vec
- 00:45:30 – Previous training and too many dimensions
- 00:50:07 – How to use word2vec and Gensim?
- 00:51:26 – What types of projects for word2vec and Gensim?
- 00:54:41 – Getting into GPT and BERT in another episode
- 00:56:11 – How to follow Jodie’s work?
- 00:57:36 – Thanks and goodbye
Show Links:
- Why Google’s “sentient” AI LaMDA is nothing like a person.
- On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium
- ELIZA - Wikipedia
- eliza.py - Python 2 version by Daniel Connelly
- dabraude/Pyliza: Python3 Implementation of Eliza
- magneticpoetry.com
- Natural Language Processing With Python’s NLTK Package – Real Python
- Practical Text Classification With Python and Keras – Real Python
- Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python
- NLTK: Natural Language Toolkit
- spaCy · Industrial-strength Natural Language Processing in Python
- Natural Language Processing With spaCy in Python - Real Python
- Stemming - Wikipedia
- Lemmatization - Wikipedia
- Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn
- TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn
- Porter Stemmer: nltk.stem.porter module — NLTK
- Snowbal...
Previous Episode

Creating Documentation With MkDocs & When to Use a Python dict
How do you start building your project documentation? What if you had a tool that could do the heavy lifting and automatically write large portions directly from your code? This week on the show, Christopher Trudeau is here, bringing another batch of PyCoder’s Weekly articles and projects.
We talk about a Real Python step-by-step project from Martin Breuss about MkDocs. The project walks you through generating nice-looking and modern documentation from Markdown files and your existing code’s docstrings. The final step is to deploy your freshly generated documentation to a GitHub repository.
Christopher talks about a pair of articles arguing for and against using Python dictionaries. The first article, “Just Use Dictionaries,” pushes to keep things simple, while the second article, “Don’t Let Dicts Spoil Your Code,” contends that complex projects require something more specific.
We cover several other articles and projects from the Python community, including discussing the recent beta release of Python 3.11, 2FA for PyPI, procedural music composition with arvo, building a tic-tac-toe game with Python and Tkinter, common issues encountered while coding in Python, a type-safe library to generate SVG files, and a lightweight static analysis tool for your projects.
Course Spotlight: Dictionaries and Arrays: Selecting the Ideal Data Structure
In this course, you’ll learn about two of Python’s data structures: dictionaries and arrays. You’ll look at multiple types and classes for both of these and learn which implementations are best for your specific use cases.
Topics:
- 00:00:00 – Introduction
- 00:02:39 – Python 3.11 Release May Be Delayed
- 00:03:39 – The cursed release of Python 3.11.0b4 is now available
- 00:05:01 – PyPI 2FA Security Key Giveaway
- 00:08:01 – Build Your Python Project Documentation With MkDocs
- 00:14:12 – Don’t Let Dicts Spoil Your Code
- 00:16:22 – Just Use Dictionaries
- 00:20:12 – Sponsor: Snyk.io
- 00:20:51 – Procedural Music Composition With arvo
- 00:29:10 – Build a Tic-Tac-Toe Game With Python and Tkinter
- 00:33:59 – Video Course Spotlight
- 00:35:35 – Most Common Issue You Have Coding With Python?
- 00:45:00 – svg.py: Type-Safe Library to Generate SVG Files
- 00:48:27 – semgrep: Lightweight Static Analysis for Many Languages
- 00:53:46 – Thanks and goodbye
News:
- Python 3.11 Release May Be Delayed
- The cursed release of Python 3.11.0b4 is now available - Python.org.
- “We’ve begun rolling out a 2FA requirement”: PyPI on Twitter
- PyPI 2FA Security Key Giveaway
Topic Links:
- Build Your Python Project Documentation With MkDocs – In this tutorial, you’ll learn how to build professional documentation for a Python package using MkDocs and mkdocstrings. These tools allow you to generate nice-looking and modern documentation from Markdown files and, more importantly, from your code’s docstrings.
- Don’t Let Dicts Spoil Your Code – The dict is the go-to data structure for Python programmers, but its loose relationship to the data can be problematic in large data streams. Learn more about why and when you might choose a different data structure.
- Just Use Dictionaries – Using simple data structures is an important part of keeping it simple, and Python is all about simplicity. Less code means fewer problems. Just use dictionaries. You probably don’t need classes.
- Procedural Music Composition With arvo – By using the music21 and avro libraries, you can create musical scores programmatically. This article runs you through which libraries you need and how you can compose your own music.
- Build a Tic-Tac-Toe Game With Python and Tkinter – In this step-by-step project, you’ll learn how to create a tic-tac-toe game using Python and the Tkinter GUI framework. Tkinter is cross-platform and is available in the Python standard library. Creating a game in Python is a great and fun way to learn something new and e...
Next Episode

Inspiring Young People to Learn Python With Mission Encodeable
Is there someone in your life you’d like to inspire to learn Python? Mission Encodeable is a website designed to teach people to code, built by two high-school students. This week on the show, Anna and Harry Wake talk about creating their site and motivating people to start coding.
We discuss why they decided to build the site. Anna and Harry initially felt that the site would be for other students but soon realized it could be helpful for anyone interested in starting to code in Python. We cover the project-based approach and how they implemented the interactive browser-based tool replit.com.
We talk about learning Python in the classroom and how they found additional books and tutorials to supplement their coding education. Anna and Harry also created a resource hub to help teachers take advantage of the site.
Course Spotlight: Rock, Paper, Scissors With Python: A Command Line Game
In this course, you’ll learn to program rock paper scissors in Python from scratch. You’ll learn how to take in user input, make the computer choose a random action, determine a winner, and split your code into functions.
Topics:
- 00:00:00 – Introduction
- 00:02:17 – Personal backgrounds
- 00:02:51 – What’s the goal for the site?
- 00:03:54 – How did you come up with the idea?
- 00:05:08 – Where have you shared it?
- 00:06:39 – Projects for each level
- 00:09:28 – How has the response been?
- 00:10:10 – Using replit
- 00:12:56 – Sponsor: CData Software
- 00:13:37 – Design of the site and other tools to create it
- 00:15:49 – Learning Python and classes at school
- 00:17:41 – Did remote school inspire more online exploration?
- 00:19:16 – Myths of how kids learn programming
- 00:23:32 – More about projects
- 00:27:57 – Video Course Spotlight
- 00:29:27 – What other areas of Python do you want to explore?
- 00:33:08 – Teachers using the site
- 00:37:11 – What other resources have you used to learn Python?
- 00:38:52 – What are you excited about in the world of Python?
- 00:40:01 – What do you want to learn next?
- 00:42:06 – Thanks and goodbye
Show Links:
- Mission Encodeable | Free coding tutorials for young people
- Replit - The collaborative browser based IDE
- Make Your First Python Game: Rock, Paper, Scissors! – Real Python
- Figma: the collaborative interface design tool.
- React – A JavaScript library for building user interfaces
- Coding with Minecraft - Al Sweigart
- The Recursive Book of Recursion - Al Sweigart
- Codewars - Achieve mastery through coding practice and developer mentorship
- Advent of Code 2021
- Object-Oriented Programming (OOP) in Python 3 – Real Python
- Python Arcade
- Craig’n’Dave “Unscripted” - Mission Encodeable - YouTube
- Hello World issue 19 — Hello World
- LearningDust: LearningDust 3.15 - Anna & Harry Wake
- Teaching Python Episode 93: Mission Encodeable
- Mission Encodeable (@missionencode) / Twitter
Level up your Python skills with our expert-led courses:
If you like this episode you’ll love

The Why And The What – Product Management Podcast

The Edtech Podcast

The Art of LiveOps

CodeWinds - Leading edge web developer news and training | javascript / React.js / Node.js / HTML5 / web development - Jeff Barczewski

Joomla Beat Podcast | Web design, development, online marketing, social media & website management
Episode Comments
Featured in these lists
Generate a badge
Get a badge for your website that links back to this episode
<a href="https://goodpods.com/podcasts/the-real-python-podcast-186798/natural-language-processing-and-how-ml-models-understand-text-22360807"> <img src="https://storage.googleapis.com/goodpods-images-bucket/badges/generic-badge-1.svg" alt="listen to natural language processing and how ml models understand text on goodpods" style="width: 225px" /> </a>
Copy