
Docker + Python for Data Science and Machine Learning
05/08/20 • 55 min
Docker is a common tool for Python developers creating and deploying applications, but what do you need to know if you want to use Docker for data science and machine learning? What are the best practices if you want to start using containers for your scientific projects? This week we have Tania Allard on the show. She is a Sr. Developer Advocate at Microsoft focusing on Machine Learning, scientific computing, research and open source.
Tania has created a talk for the PyCon US 2020 which is now online. The talk is titled “Docker and Python: Making them Play Nicely and Securely for Data Science and ML.” Her talk draws on her expertise in the improvement of processes, reproducibility and transparency in research and data science. We discuss a variety of tools for making your containers more secure and results reproducible.
Tania is passionate about mentoring, open-source, and its community. She is an organizer for Mentored Sprints for Diverse Beginners, and she talks about the upcoming online sprints for PyCon US 2020. We also discuss her plans to start a podcast.
Topics:
- 00:00:00 – Introduction
- 00:01:43 – Microsoft Senior Developer Advocate Role
- 00:04:07 – PyCon 2020 Talk - Docker and Python: making them play nicely
- 00:05:34 – What is Docker?
- 00:10:08 – Reproducibility of project results
- 00:12:03 – What are the challenges of using Docker for machine learning?
- 00:15:06 – Getting started suggestions
- 00:16:26 – What metadata should be included?
- 00:17:48 – Creating images through stages
- 00:21:16 – What about your data?
- 00:22:40 – Kubernetes: Orchestrating containers
- 00:24:37 – Continuing stages into testing
- 00:25:37 – What are tools for testing security?
- 00:27:07 – Challenges in using containers for ML
- 00:28:52 – What types of databases?
- 00:29:39 – Are you doing initial research on a local machine?
- 00:30:59 – An example of a recent ML project
- 00:32:16 – Papermill: parameterizing and executing notebooks
- 00:33:16 – NLP: Natural Language Processing
- 00:33:58 – Kaggle: Help us better understand COVID-19
- 00:34:42 – What are other best practices for data intensive projects?
- 00:39:13 – Resources to get started in machine learning?
- 00:40:30 – Mentored Sprints for Diverse Beginners
- 00:45:34 – Tania’s upcoming podcast
- 00:48:38 – A visiting fellow at the Alan Turing Institute
- 00:49:08 – Weight lifting
- 00:50:16 – Craft beer
- 00:52:09 – What is something you thought you knew in Python but were wrong about?
- 00:53:50 – What are excited about in the world of Python?
- 00:54:42 – Thank you and Goodbye
Show links:
- Tania Allard: Personal site
- Docker and Python: making them play nicely and securely for Data Science and ML - Tania Allard
- Slides for Docker and Python Talk
- Docker
- XKCD: Python Superfund Site
- Best practices for writing Dockerfiles
- Run Python Versions in Docker: How to Try the Latest Python Release
- Kubernetes: Production-Grade Container Orchestration
- Snyk: Securing open source and containers
- papermill: A tool for parameterizing and executing Jupyter Notebooks
- Natural Language Processing: Wikipedia article
- Natural Language Processing With spaCy in Python: Real Python article
- Kaggle: Help us better understand COVID-19
- datree.io: Scale Engineering organization
- repo2docker: Build, Run, and Push Docker Images from Source Code Repositories
- Jupyter Docker Stacks: A set of ready-to-run Docker images
- binder: Turn a Git Repo into a Collection of Interactive Notebooks
- Hands...
Docker is a common tool for Python developers creating and deploying applications, but what do you need to know if you want to use Docker for data science and machine learning? What are the best practices if you want to start using containers for your scientific projects? This week we have Tania Allard on the show. She is a Sr. Developer Advocate at Microsoft focusing on Machine Learning, scientific computing, research and open source.
Tania has created a talk for the PyCon US 2020 which is now online. The talk is titled “Docker and Python: Making them Play Nicely and Securely for Data Science and ML.” Her talk draws on her expertise in the improvement of processes, reproducibility and transparency in research and data science. We discuss a variety of tools for making your containers more secure and results reproducible.
Tania is passionate about mentoring, open-source, and its community. She is an organizer for Mentored Sprints for Diverse Beginners, and she talks about the upcoming online sprints for PyCon US 2020. We also discuss her plans to start a podcast.
Topics:
- 00:00:00 – Introduction
- 00:01:43 – Microsoft Senior Developer Advocate Role
- 00:04:07 – PyCon 2020 Talk - Docker and Python: making them play nicely
- 00:05:34 – What is Docker?
- 00:10:08 – Reproducibility of project results
- 00:12:03 – What are the challenges of using Docker for machine learning?
- 00:15:06 – Getting started suggestions
- 00:16:26 – What metadata should be included?
- 00:17:48 – Creating images through stages
- 00:21:16 – What about your data?
- 00:22:40 – Kubernetes: Orchestrating containers
- 00:24:37 – Continuing stages into testing
- 00:25:37 – What are tools for testing security?
- 00:27:07 – Challenges in using containers for ML
- 00:28:52 – What types of databases?
- 00:29:39 – Are you doing initial research on a local machine?
- 00:30:59 – An example of a recent ML project
- 00:32:16 – Papermill: parameterizing and executing notebooks
- 00:33:16 – NLP: Natural Language Processing
- 00:33:58 – Kaggle: Help us better understand COVID-19
- 00:34:42 – What are other best practices for data intensive projects?
- 00:39:13 – Resources to get started in machine learning?
- 00:40:30 – Mentored Sprints for Diverse Beginners
- 00:45:34 – Tania’s upcoming podcast
- 00:48:38 – A visiting fellow at the Alan Turing Institute
- 00:49:08 – Weight lifting
- 00:50:16 – Craft beer
- 00:52:09 – What is something you thought you knew in Python but were wrong about?
- 00:53:50 – What are excited about in the world of Python?
- 00:54:42 – Thank you and Goodbye
Show links:
- Tania Allard: Personal site
- Docker and Python: making them play nicely and securely for Data Science and ML - Tania Allard
- Slides for Docker and Python Talk
- Docker
- XKCD: Python Superfund Site
- Best practices for writing Dockerfiles
- Run Python Versions in Docker: How to Try the Latest Python Release
- Kubernetes: Production-Grade Container Orchestration
- Snyk: Securing open source and containers
- papermill: A tool for parameterizing and executing Jupyter Notebooks
- Natural Language Processing: Wikipedia article
- Natural Language Processing With spaCy in Python: Real Python article
- Kaggle: Help us better understand COVID-19
- datree.io: Scale Engineering organization
- repo2docker: Build, Run, and Push Docker Images from Source Code Repositories
- Jupyter Docker Stacks: A set of ready-to-run Docker images
- binder: Turn a Git Repo into a Collection of Interactive Notebooks
- Hands...
Previous Episode

AsyncIO + Music, Origins of Black, and Managing Python Releases
Want to learn more about AsyncIO in Python, with an example where you can see and hear events being triggered in real-time? This week we have Łukasz Langa on the show. Łukasz has created a talk for PyCon 2020 online about using AsyncIO with Music.
In his talk he shows live examples of coroutines, gathering, the event loop and events being triggered to create a piece of music. We also talk about his role as the release manager for Python 3.8 and 3.9. Łukasz provides background on the origins of his very popular, uncompromising code formatter, Black, and the types of problems it can solve inside of an organization.
Łukasz previously worked for Facebook, which is where he started Black. He talks about recently moving back to Poland. We discuss his current work for Edge DB, building a new generation object-relational database.
Topics:
- 00:00:00 – Introduction
- 00:01:32 – Łukasz’s background
- 00:03:22 – Leaving Facebook and moving back to Poland
- 00:05:26 – Starting work with EdgeDB
- 00:06:07 – What is Edge DB?
- 00:12:28 – AsyncIO + Music PyCon 2020 talk
- 00:18:56 – More AsyncIO resources
- 00:23:36 – Comparing the event loop to a game loop
- 00:27:12 – Coroutines and gather
- 00:30:00 – A conversation with Glyph
- 00:33:40 – Bigger ideas for the AsyncIO MIDI sequencer
- 00:35:41 – Using uvloop as a replacement for the built-in reference AsyncIO loop
- 00:39:13 – Thoughts on MIDI 2.0
- 00:46:30 – Origins of Black
- 00:53:51 – Black grows in popularity
- 00:58:35 – What is involved in being the Python 3.9 release manager?
- 01:02:22 – The Python language summit
- 01:07:44 – Is the beta on schedule?
- 01:09:27 – How did you get the role of Release Manager?
- 01:15:09 – What are you excited about in the world of Python?
- 01:19:02 – If you were learning Python from scratch, what would do differently?
- 01:22:18 – What is something you thought you knew about Python, but were wrong about?
- 01:26:05 – Goodbye and Thanks
Show links:
- Łukasz Langa - AsyncIO + Music - PyCon 2020
- Edge DB: The next generation database
- Edge DB YouTube Channel - Learn Python’s AsyncIO - Series
- PyCon 2020 Online Launch!
- code::dive 2017 – Łukasz Langa – Thinking in coroutines
- code::dive 2019 - Łukasz Langa - AsyncIO and Music - Earlier version
- John Carmack: “it’s time to start pushing forward on higher frame-rate, lower latency” - PCGamesN
- Glyph Lefkowitz: Wikipedia Article
- Orca: an esoteric programming language designed to quickly create procedural sequencer
- uvloop: an ultra fast implementation of the asyncio event loop
- Introducing MIDI 2.0 - Sound on Sound
- Polyend Tracker: Break the pattern
- YAPF: Python code formatter from Google
- Black: The uncompromising Python code formatter
- Łukasz Langa - Life Is Better Painted Black, or: How to Stop Worrying and Embrace Auto-Formatting - PyCon 2019
- The 2020 Python Language Summit
- Winterbloom: Synth Modules You Can Make Your Own
- Starlette: ✨ The little ASGI framework that shines. ✨
- CircuitPython
- ambv - Łukasz Langa’s GitHub
Level up your Python skills with our expert-led courses:
Next Episode

Leveling Up Your Python Literacy and Finding Python Projects to Study
In your quest to become a better developer, how do you find Python code that is at your reading level? What are good code bases or projects to study? What are the things holding you back from leveling up your Python literacy? This week we have Cecil Phillip on the show to discuss all of these common questions. Cecil is a Senior Cloud Advocate at Microsoft.
Cecil has been learning Python in the open on Twitch with Brian Clark. They run a weekly event on Twitch, where they are live-streaming an interactive Python course. Cecil has a background in multiple languages and technologies, and now he’s learning Python, bringing an audience along the way!
We start things off with a listener question and jump into a conversation about building up your Python skills. Then we’ll discuss common Python language stumbling blocks. Next we consider the importance of making personal projects, and documenting that code.
We also touch on some unique skills employers are looking for. And we discuss working through impostor syndrome. Cecil talks about his podcast “Away from the Keyboard” and his plans to start it back up.
In the show notes this week you’ll find links to resources we discuss, and several more that we didn’t have time to cover individually.
Want your question featured on the show? Send us your question at realpython.com/podcast-question and we might feature it on a future episode of the show.
Topics:
- 00:00:00 – Intro
- 00:01:52 – Cecil’s role at Microsoft
- 00:03:35 – Twitch Stream with Brian Clark
- 00:05:07 – Learning in front of an audience
- 00:13:05 – Listener’s question
- 00:14:46 – Finding code that’s at your level
- 00:20:31 – Understanding more complex syntax in Python
- 00:23:40 – Breaking down complexity
- 00:29:17 – Translation of code
- 00:31:55 – Importance of making projects and comments
- 00:36:28 – Finding community
- 00:41:23 – Open source contributing
- 00:42:25 – Dealing with impostor syndrome
- 00:49:09 – Looking for that first position
- 01:00:58 – More project resources in show notes
- 01:02:55 – Cecil’s podcast - Away from the keyboard
- 01:08:29 – What are you excited about in the world of Python?
- 01:10:14 – What is something you thought you knew about Python but were wrong about it?
- 01:12:01 – What’s the next thing you want to learn in Python?
- 01:13:37 – Read the actual Python docs
- 01:15:24 – Thanks and goodbye
Show links:
- Microsoft Developer Channel
- Cecil Phillip’s Twitter
- Cecil’s Github
- Microsoft Developer Twitch
- Official Microsoft Python Discord
- Away from the Keyboard: Podcast
- Python Decorators 101: Real Python video course
- Python Type Checking: Real Python video course
- 13 Project Ideas for Intermediate Python Developers: Real Python article
Suggested project reading list:
- Flask: The Python micro framework for building web applications.
- Django: The Web framework for perfectionists with deadlines
- Howdoi: instant coding answers via the command line
- Curio: A coroutine-based library for concurrent Python systems programming
- scikit-learn: machine learning in Python
- SQLAlchemy: The Database Toolkit for Python
- Requests: A simple, yet elegant HTTP library
- Markupsafe: Safely add untrusted strings to HTML/XML markup
- Ask HN: Good Python codebases to read?
- The Hitchhiker’s Guide to Python: Reading Great Code
- Welcome! This is the documentation for Python 3.8
Level up your Python skills with our expert-led courses:
If you like this episode you’ll love

The Why And The What – Product Management Podcast

CodeWinds - Leading edge web developer news and training | javascript / React.js / Node.js / HTML5 / web development - Jeff Barczewski

The Edtech Podcast

The Art of LiveOps

Joomla Beat Podcast | Web design, development, online marketing, social media & website management
Episode Comments
Featured in these lists
Generate a badge
Get a badge for your website that links back to this episode
<a href="https://goodpods.com/podcasts/the-real-python-podcast-186798/docker-python-for-data-science-and-machine-learning-17007732"> <img src="https://storage.googleapis.com/goodpods-images-bucket/badges/generic-badge-1.svg" alt="listen to docker + python for data science and machine learning on goodpods" style="width: 225px" /> </a>
Copy