Log in

goodpods headphones icon

To access all our features

Open the Goodpods app
Close icon
The Real Python Podcast - Natural Language Processing and How ML Models Understand Text

Natural Language Processing and How ML Models Understand Text

07/29/22 • 58 min

1 Listener

The Real Python Podcast

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

  • 00:00:00 – Introduction
  • 00:02:47 – Exploring the topic
  • 00:06:00 – Perceived sentience of LaMDA
  • 00:10:24 – How do we get started?
  • 00:11:16 – What are classification and sentiment analysis?
  • 00:13:03 – Transforming text in rows and columns
  • 00:14:47 – Sponsor: Snyk
  • 00:15:27 – Bag-of-words approach
  • 00:19:12 – Stemming and lemmatization
  • 00:22:05 – Capturing N-grams
  • 00:25:34 – Count vectorization
  • 00:27:14 – Stop words
  • 00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
  • 00:32:28 – Potential projects for bag-of-words techniques
  • 00:34:07 – Video Course Spotlight
  • 00:35:20 – WordNet and NLTK package
  • 00:37:27 – Word embeddings and word2vec
  • 00:45:30 – Previous training and too many dimensions
  • 00:50:07 – How to use word2vec and Gensim?
  • 00:51:26 – What types of projects for word2vec and Gensim?
  • 00:54:41 – Getting into GPT and BERT in another episode
  • 00:56:11 – How to follow Jodie’s work?
  • 00:57:36 – Thanks and goodbye

Show Links:

plus icon
bookmark

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

  • 00:00:00 – Introduction
  • 00:02:47 – Exploring the topic
  • 00:06:00 – Perceived sentience of LaMDA
  • 00:10:24 – How do we get started?
  • 00:11:16 – What are classification and sentiment analysis?
  • 00:13:03 – Transforming text in rows and columns
  • 00:14:47 – Sponsor: Snyk
  • 00:15:27 – Bag-of-words approach
  • 00:19:12 – Stemming and lemmatization
  • 00:22:05 – Capturing N-grams
  • 00:25:34 – Count vectorization
  • 00:27:14 – Stop words
  • 00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
  • 00:32:28 – Potential projects for bag-of-words techniques
  • 00:34:07 – Video Course Spotlight
  • 00:35:20 – WordNet and NLTK package
  • 00:37:27 – Word embeddings and word2vec
  • 00:45:30 – Previous training and too many dimensions
  • 00:50:07 – How to use word2vec and Gensim?
  • 00:51:26 – What types of projects for word2vec and Gensim?
  • 00:54:41 – Getting into GPT and BERT in another episode
  • 00:56:11 – How to follow Jodie’s work?
  • 00:57:36 – Thanks and goodbye

Show Links:

Previous Episode

undefined - Creating Documentation With MkDocs & When to Use a Python dict

Creating Documentation With MkDocs & When to Use a Python dict

How do you start building your project documentation? What if you had a tool that could do the heavy lifting and automatically write large portions directly from your code? This week on the show, Christopher Trudeau is here, bringing another batch of PyCoder’s Weekly articles and projects.

We talk about a Real Python step-by-step project from Martin Breuss about MkDocs. The project walks you through generating nice-looking and modern documentation from Markdown files and your existing code’s docstrings. The final step is to deploy your freshly generated documentation to a GitHub repository.

Christopher talks about a pair of articles arguing for and against using Python dictionaries. The first article, “Just Use Dictionaries,” pushes to keep things simple, while the second article, “Don’t Let Dicts Spoil Your Code,” contends that complex projects require something more specific.

We cover several other articles and projects from the Python community, including discussing the recent beta release of Python 3.11, 2FA for PyPI, procedural music composition with arvo, building a tic-tac-toe game with Python and Tkinter, common issues encountered while coding in Python, a type-safe library to generate SVG files, and a lightweight static analysis tool for your projects.

Course Spotlight: Dictionaries and Arrays: Selecting the Ideal Data Structure

In this course, you’ll learn about two of Python’s data structures: dictionaries and arrays. You’ll look at multiple types and classes for both of these and learn which implementations are best for your specific use cases.

Topics:

  • 00:00:00 – Introduction
  • 00:02:39 – Python 3.11 Release May Be Delayed
  • 00:03:39 – The cursed release of Python 3.11.0b4 is now available
  • 00:05:01 – PyPI 2FA Security Key Giveaway
  • 00:08:01 – Build Your Python Project Documentation With MkDocs
  • 00:14:12 – Don’t Let Dicts Spoil Your Code
  • 00:16:22 – Just Use Dictionaries
  • 00:20:12 – Sponsor: Snyk.io
  • 00:20:51 – Procedural Music Composition With arvo
  • 00:29:10 – Build a Tic-Tac-Toe Game With Python and Tkinter
  • 00:33:59 – Video Course Spotlight
  • 00:35:35 – Most Common Issue You Have Coding With Python?
  • 00:45:00 – svg.py: Type-Safe Library to Generate SVG Files
  • 00:48:27 – semgrep: Lightweight Static Analysis for Many Languages
  • 00:53:46 – Thanks and goodbye

News:

Topic Links:

  • Build Your Python Project Documentation With MkDocs – In this tutorial, you’ll learn how to build professional documentation for a Python package using MkDocs and mkdocstrings. These tools allow you to generate nice-looking and modern documentation from Markdown files and, more importantly, from your code’s docstrings.
  • Don’t Let Dicts Spoil Your Code – The dict is the go-to data structure for Python programmers, but its loose relationship to the data can be problematic in large data streams. Learn more about why and when you might choose a different data structure.
  • Just Use Dictionaries – Using simple data structures is an important part of keeping it simple, and Python is all about simplicity. Less code means fewer problems. Just use dictionaries. You probably don’t need classes.
  • Procedural Music Composition With arvo – By using the music21 and avro libraries, you can create musical scores programmatically. This article runs you through which libraries you need and how you can compose your own music.
  • Build a Tic-Tac-Toe Game With Python and Tkinter – In this step-by-step project, you’ll learn how to create a tic-tac-toe game using Python and the Tkinter GUI framework. Tkinter is cross-platform and is available in the Python standard library. Creating a game in Python is a great and fun way to learn something new and e...

Next Episode

undefined - Inspiring Young People to Learn Python With Mission Encodeable

Inspiring Young People to Learn Python With Mission Encodeable

Is there someone in your life you’d like to inspire to learn Python? Mission Encodeable is a website designed to teach people to code, built by two high-school students. This week on the show, Anna and Harry Wake talk about creating their site and motivating people to start coding.

We discuss why they decided to build the site. Anna and Harry initially felt that the site would be for other students but soon realized it could be helpful for anyone interested in starting to code in Python. We cover the project-based approach and how they implemented the interactive browser-based tool replit.com.

We talk about learning Python in the classroom and how they found additional books and tutorials to supplement their coding education. Anna and Harry also created a resource hub to help teachers take advantage of the site.

Course Spotlight: Rock, Paper, Scissors With Python: A Command Line Game

In this course, you’ll learn to program rock paper scissors in Python from scratch. You’ll learn how to take in user input, make the computer choose a random action, determine a winner, and split your code into functions.

Topics:

  • 00:00:00 – Introduction
  • 00:02:17 – Personal backgrounds
  • 00:02:51 – What’s the goal for the site?
  • 00:03:54 – How did you come up with the idea?
  • 00:05:08 – Where have you shared it?
  • 00:06:39 – Projects for each level
  • 00:09:28 – How has the response been?
  • 00:10:10 – Using replit
  • 00:12:56 – Sponsor: CData Software
  • 00:13:37 – Design of the site and other tools to create it
  • 00:15:49 – Learning Python and classes at school
  • 00:17:41 – Did remote school inspire more online exploration?
  • 00:19:16 – Myths of how kids learn programming
  • 00:23:32 – More about projects
  • 00:27:57 – Video Course Spotlight
  • 00:29:27 – What other areas of Python do you want to explore?
  • 00:33:08 – Teachers using the site
  • 00:37:11 – What other resources have you used to learn Python?
  • 00:38:52 – What are you excited about in the world of Python?
  • 00:40:01 – What do you want to learn next?
  • 00:42:06 – Thanks and goodbye

Show Links:

Level up your Python skills with our expert-led courses:

Support the podcast & join our community of Pythonistas

Episode Comments

Featured in these lists

Generate a badge

Get a badge for your website that links back to this episode

Select type & size
Open dropdown icon
share badge image

<a href="https://goodpods.com/podcasts/the-real-python-podcast-186798/natural-language-processing-and-how-ml-models-understand-text-22360807"> <img src="https://storage.googleapis.com/goodpods-images-bucket/badges/generic-badge-1.svg" alt="listen to natural language processing and how ml models understand text on goodpods" style="width: 225px" /> </a>

Copy