the bioinformatics chat

Roman Cheplyaka

A podcast about computational biology, bioinformatics, and next generation sequencing.

All episodes

Best episodes

Top 10 the bioinformatics chat Episodes

Goodpods has curated a list of the 10 best the bioinformatics chat episodes, ranked by the number of listens and likes each episode have garnered from our listeners. If you are listening to the bioinformatics chat for the first time, there's no better place to start than with one of these standout episodes. If you are a fan of the show, vote for your favorite the bioinformatics chat episode by adding your comments to the episode page.

#14 Generating functions for read mapping with Guillaume Filion

the bioinformatics chat

11/13/17 • 70 min

Guillaume Filion recently published a preprint in which he applies generating functions, a concept from analytic combinatorics, to estimating the optimal seed length for read mapping.

In this episode, Guillaume and I attempt to explain the core concepts from analytic combinatorics and why they are useful in modeling sequences.

Links:

Guillaume’s preprint: Analytic combinatorics for bioinformatics I: seeding methods
Once upon a BLAST
Guillaume’s blog, «The Grand Locus»
Dan Gusfield’s home page featuring the fast fourier transform lectures I mention in the podcast

After we recorded the podcast, Guillaume wrote to me to clarify the relationship between read mapping and BLAST:

I looked into my notes about BLAST. The problem that it solves is the following: “Given that a local alignment has score S, what is the probability that it does not contain a word of score T or greater”? The background work of Karlin and Altschul is used to give a statistical significance for S (what is the probability that a “Smith-Waterman random walk” starting at height 0 would reach height S, i.e. what is the probability that aligning two random proteins would yield a score S). The authors write in the original paper “Theory does not yet exist to calculate the probability q that such segment pair will contain a word pair with a score of at least T. However, one argument suggests that q should depend exponentially upon the score of the MSP”.

This is the part that I did not remember well. MSP stands for Maximal Segment Pair, this is the “longest fragment” with “highest score” in the alignment. I thought that Karlin and Altschul solved this part as well, but the authors just go empirical and they calibrate the relationship between T and S with simulations.

I realize a little bit better now that my work is precisely about this problem that the authors of BLAST could not solve, but as you pointed out, I am attacking only a very specific sub-case that is much easier because the models of sequencing error are much simpler than protein evolution. BLAST is concerned with local alignment, so it wants to get all the hits with an MSP score above S. Short read mapping just wants the true location of the read, which does not really have the notion of a score S. But still, mathematically, it is equivalent to the case where S is a constant that depends only on the read size and the distribution of the score T depends only on the seed length and the error rate. I have a few ideas of how to use analytic combinatorics to solve the problem for proteins, but it is mostly complicated because the variable of interest T is a fractional numbers and not an integer...

So what is different from BLAST? The right answer (I think) is that BLAST finds all the hits with an MSP above statistical background, but it says nothing of the probability that the true location contains such an MSP, so it is hard to calibrate the heuristic for that specific problem. In reality, the parallel with BLAST is just the basic strategy: make a statistical model for your problem and use it to calibrate the heuristic.

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#60 Differential gene expression and DESeq2 with Michael Love

the bioinformatics chat

05/12/21 • 91 min

In this episode, Michael Love joins us to talk about the differential gene expression analysis from bulk RNA-Seq data.

We talk about the history of Mike’s own differential expression package, DESeq2, as well as other packages in this space, like edgeR and limma, and the theory they are based upon. Mike also shares his experience of being the author and maintainer of a popular bioninformatics package.

Links:

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (Love, M.I., Huber, W. & Anders, S.)
DESeq2 on Bioconductor
Chan Zuckerberg Initiative: Ensuring Reproducible Transcriptomic Analysis with DESeq2 and tximeta

And a more comprehensive set of links from Mike himself:

limma, the original paper and limma-voom:
https://pubmed.ncbi.nlm.nih.gov/16646809/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053721/

edgeR papers:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/

The recent manuscript mentioned from the Kendziorski lab, which has a Gamma-Poisson hierarchical structure, although it does not in general reduce to the Negative Binomial:
https://doi.org/10.1101/2020.10.28.359901

We talk about robust steps for estimating the middle of the dispersion prior distribution, references are Anders and Huber 2010 (DESeq), Eling et al 2018 (one of the BASiCS papers), and Phipson et al 2016:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218662/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6167088/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373812/

The Stan software:
https://mc-stan.org/

We talk about using publicly available data as a prior, references I mention are the McCall et al paper using publicly available data to ask if a gene is expressed, and a new manuscript from my lab that compares splicing in a sample to GTEx as a reference panel:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013751/ https://doi.org/10.1101/856401

Regarding estimating the width of the dispersion prior, references are the Robinson and Smyth 2007 paper, McCarthy et al 2012 (edgeR), and Wu et al 2013 (DSS):
https://pubmed.ncbi.nlm.nih.gov/17881408/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3590927/

Schurch et al 2016, a RNA-seq dataset with many replicates, helpful for benchmarking:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/

Stephens paper on the false sign rate (ash):
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379932/

Heavy-tailed distributions for effect sizes, Zhu et al 2018:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581436/

I credit Kevin Blighe and Alexander Toenges, who help to answer lots of DESeq2 questions on the support site:
https://www.biostars.org/u/41557/
https://www.biostars.org/u/25721/

The EOSS award, which has funded vizWithSCE by Kwame Forbes, and nullranges by Wancen Mu and Eric Davis:
https://chanzuckerb...

#48 Machine learning for drug development with Marinka Zitnik

the bioinformatics chat

07/29/20 • 85 min

In this episode, Jacob Schreiber interviews Marinka Zitnik about applications of machine learning to drug development. They begin their discussion with an overview of open research questions in the field, including limiting the search space of high-throughput testing methods, designing drugs entirely from scratch, predicting ways that existing drugs can be repurposed, and identifying likely side-effects of combining existing drugs in novel ways. Focusing on the last of these areas, they then discuss one of Marinka’s recent papers, Modeling polypharmacy side effects with graph convolutional networks.

Links:

Modeling polypharmacy side effects with graph convolutional networks (Marinka Zitnik, Monica Agrawal, Jure Leskovec)
Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19 (Deisy Morselli Gysi, Ítalo Do Valle, Marinka Zitnik, Asher Ameli, Xiao Gan, Onur Varol, Helia Sanchez, Rebecca Marlene Baron, Dina Ghiassian, Joseph Loscalzo, Albert-László Barabási)
AI Cures initiative

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#13 Bracken with Jennifer Lu

the bioinformatics chat

10/21/17 • 46 min

Jennifer Lu joins me to discuss species abundance estimation from metagenomic sequencing data.

Links:

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#66 AlphaFold and shape-mers with Janani Durairaj

the bioinformatics chat

07/10/23 • 20 min

This is the second episode in the AlphaFold series, originally recorded on February 14, 2022, with Janani Durairaj, a postdoctoral researcher at the University of Basel.

Janani talks about how she used shape-mers and topic modelling to discover classes of proteins assembled by AlphaFold 2 that were absent from the Protein Data Bank (PDB).

The bioinformatics discussion starts at 03:35.

Links:

A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H. M. Rodrigues, Alistair S. Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V. Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Durairaj, David B. Ascher, Janet M. Thornton, Norman E. Davey, Amelie Stein, Arne Elofsson, Tristan I. Croll & Pedro Beltrao)
The Protein Universe Atlas
What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds (Janani Durairaj, Andrew M. Waterhouse, Toomas Mets, Tetiana Brodiazhenko, Minhal Abdullah, Gabriel Studer, Mehmet Akdel, Antonina Andreeva, Alex Bateman, Tanel Tenson, Vasili Hauryliuk, Torsten Schwede, Joana Pereira)
Geometricus: Protein Structures as Shape-mers derived from Moment Invariants on GitHub
The group page
The Folded Weekly newsletter
A New York Times article about the Kramatorsk missile strike. The Instagram video, part of which you can hear at the beginning of the episode, appears to have been deleted.

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#65 AlphaFold and protein interactions with Pedro Beltrao

the bioinformatics chat

06/21/23 • 52 min

In this episode, originally recorded on February 9, 2022, Roman talks to Pedro Beltrao about AlphaFold, the software developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence.

Pedro is an associate professor at ETH Zurich and the coordinator of the structural biology community assessment of AlphaFold2 applications project, which involved over 30 scientists from different institutions.

Pedro talks about the origins of the project, its main findings, the importance of the confidence metric that AlphaFold assigns to its predictions, and Pedro’s own area of interest — predicting pockets in proteins and protein-protein interactions.

Links:

A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H. M. Rodrigues, Alistair S. Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V. Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Durairaj, David B. Ascher, Janet M. Thornton, Norman E. Davey, Amelie Stein, Arne Elofsson, Tristan I. Croll & Pedro Beltrao)
Pedro’s group at ETH Zurich

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#69 Suffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik Kempa

the bioinformatics chat

09/29/23 • 56 min

Today on the podcast we have Tomasz Kociumaka and Dominik Kempa, the authors of the preprint Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space.

The suffix array is one of the foundational data structures in bioinformatics, serving as an index that allows fast substring searches in a large text. However, in its raw form, the suffix array occupies the space proportional to (and several times larger than) the original text.

In their paper, Tomasz and Dominik construct a new index, δ-SA, which on the one hand can be used in the same way (answer the same queries) as the suffix array and the inverse suffix array, and on the other hand, occupies the space roughly proportional to the gzip’ed text (or, more precisely, to the measure δ that they define — hence the name).

Moreover, they mathematically prove that this index is optimal, in the sense that any index that supports these queries — or even much weaker queries, such as simply accessing the i-th character of the text — cannot be significantly smaller (as a function of δ) than δ-SA.

Links:

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space (Dominik Kempa, Tomasz Kociumaka)

Thank you to Jake Yeung and other Patreon members for supporting this episode.

#68 Phylogenetic inference from raw reads and Read2Tree with David Dylus

the bioinformatics chat

08/28/23 • 49 min

In this episode, David Dylus talks about Read2Tree, a tool that builds alignment matrices and phylogenetic trees from raw sequencing reads. By leveraging the database of orthologous genes called OMA, Read2Tree bypasses traditional, time-consuming steps such as genome assembly, annotation and all-versus-all sequence comparisons.

Links:

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree (David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck, Christophe Dessimoz)
Background story
Read2Tree on GitHub
OMA browser
The Guardian’s podcast about Victoria Amelina and Volodymyr Vakulenko

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#67 AlphaFold and variant effect prediction with Amelie Stein

the bioinformatics chat

07/29/23 • 35 min

This is the third and final episode in the AlphaFold series, originally recorded on February 23, 2022, with Amelie Stein, now an associate professor at the University of Copenhagen.

In the episode, Amelie explains what ΔΔG is, how it informs us whether a particular protein mutation affects its stability, and how AlphaFold 2 helps in this analysis.

A note from Amelie:

Something that has happened in the meantime is the publication of methods that predict ΔΔG with ML methods, so much faster than Rosetta. One of them, RaSP, is from our group, while ddMut is from another subset of authors of the AF2 community assessment paper.

#15 Optimal transport for single-cell expression data with Geoffrey Schiebinger

the bioinformatics chat

11/26/17 • 68 min

Geoffrey Schiebinger explains how reconstructing developmental trajectories from single-cell RNA-seq data can be reduced to the mathematical problem called optimal transport.

Links:

If you enjoyed this episode, please consider supporting the podcast on Patreon.

Show more best episodes

FAQ

How many episodes does the bioinformatics chat have?

the bioinformatics chat currently has 70 episodes available.

What topics does the bioinformatics chat cover?

The podcast is about Life Sciences, Genetics, Podcasts, Science, Bioinformatics and Biology.

What is the most popular episode on the bioinformatics chat?

The episode title '#68 Phylogenetic inference from raw reads and Read2Tree with David Dylus' is the most popular.

What is the average episode length on the bioinformatics chat?

The average episode length on the bioinformatics chat is 63 minutes.

How often are episodes of the bioinformatics chat released?

Episodes of the bioinformatics chat are typically released every 28 days.

When was the first episode of the bioinformatics chat?

The first episode of the bioinformatics chat was released on Apr 16, 2017.

Show more FAQ

the bioinformatics chat

Roman Cheplyaka

Top 10 the bioinformatics chat Episodes

FAQ

How many episodes does the bioinformatics chat have?

What topics does the bioinformatics chat cover?

What is the most popular episode on the bioinformatics chat?

What is the average episode length on the bioinformatics chat?

How often are episodes of the bioinformatics chat released?

When was the first episode of the bioinformatics chat?

Comments