Author: Michelle Hallenbeck

I'm a PhD candidate at the University of Louisville in the lab of Dr. James Collins studying how sugar metabolism enhances the proliferation of Enterococcus faecium. My current research focuses on developing bioreactor and wax moth larvae models to characterize mutants in several key sugar metabolism genes. When not in the lab, I enjoy reading, writing, and spending time with my leopard gecko, Samantha. I started this blog to practice science writing by writing simplified protocols for the techniques I learn while doing research, and also to just talk about interesting science topics in general. Enjoy!

How to generate clean deletion mutants in Enterococcus

February 8, 2026 Michelle HallenbeckLeave a comment

Greetings, fellow scientists! This article will walk through the method we use in my lab for knocking out genes in Enterococcus faecium and Enterococcus faecalis.

Briefly, we use a plasmid with a few key elements; including an antibiotic selection marker and a sequence consisting of a promoter, CRISPR spacer, and the upstream and downstream regions of the gene to be deleted, with an optional insert in between the US and DS regions. The insert can be used to detect the presence of your mutant via qPCR if desired.

I’ll point out where you can adapt this method to your own bug where applicable.

Designing the plasmid

Our plasmid backbone includes the following elements:

An origin of replication (for replicating the plasmid in E. coli)
A tetracycline-inducible promoter (for triggering the process of swapping the gene of interest on the chromosome with the gene-less sequence on the plasmid)
An antibiotic selection marker (for maintaining the plasmid in E. coli or E. faecium/faecalis, e.g. gentamycin, erythromycin)

You may have access to a similar plasmid that you could use; as long as you have something that fulfills each of the above functions, this method should work for you.

The sequence is assembled (US region + insert + DS region) in SnapGene, then ordered from Twist Biosciences, who will generate it and send it to you. Alternatively, you could use SOE (splicing by overlap extension) PCR to generate the sequence yourself- described here.

The sequence will consist of the following elements, in order from 5′ to 3′:

P3 promoter: CATTCTACAGTTTATTCTTGACATTGCACTGTCCCCCTGGTATAATAACTAT
First repeat: AATTTCTACTCTTGTAGAT
CRISPR target
- Identified by finding a region near the beginning of the gene of interest with three or more T nucleotides followed by any nucleotide (e.g. TTTV, TTTTV, TTTTTV
- The 23 base pairs immediately following that sequence is the CRISPR target
Second repeat: AATTTCTACTCTTGTAGAT
US region for gene of interest (~500 bp)
qPCR-selectable marker (optional)
- To generate the qPCR marker, go to this website (https://faculty.ucr.edu/~mmaduro/random.htm), set the size to 90 bp and the GC content to 0.5.
DS region for gene of interest (~500 bp)

To complete this sequence, tags must be added to the 5′ and 3′ ends which are complementary to the ends of your plasmid backbone. This will make it possible to generate the overlapping ends necessary for insertion into the plasmid backbone.

Desiging all necessary primers

To design primers to check for clean deletion of your gene at the end of this process:

Select a region of the genome encompassing >500 bp upstream and downstream of the US and DS arms of the gene of interest and copy and paste into Primer3 (https://www.primer3plus.com/)
To force Primer3 to pick primers that fall on either side of the US and DS arms, specify the locations of the left and right primer as follows (https://www.primer3plus.com/primer3plusHelp.html#SEQUENCE_PRIMER_PAIR_OK_REGION_LIST)
- For example, if the regions on either side of the US and DS arms are 500 bp long, and the middle region (including the US and DS arms and the gene of interest) is 1500 bp long, then you would enter the following into the ‘Pair OK Region List’ box: 1, 500, 2001, 500
Check primers against the genome and the insert to double check they hit only on the genome and not on the insert

To design primers for selection of your mutant via qPCR, you need only copy and paste the randomly generated marker into IDT’s PrimerQuest tool (https://www.idtdna.com/PrimerQuest/Home/Index)

Your primers should arrived in lyophilized (dry powder) form. After you have your primers:

Resuspend lyophilized insert to 50 ng/ul in low EDTA buffer (so that it doesn’t chelate all the metal ions in the PCR and prevent the PCR from working)
- Elution buffer from Zyppy plasmid prep kit works for this purpose
Spin down resuspended primers for ~7 seconds (hold down button) in microcentrifuge
Resuspend in water to 100 uM (approximate this by looking at the nmol number, moving the decimal point one place to the right, then adding that amount of water in uL) to make stocks
Add 10 ul of this to 90 uL water to make working stocks to be used in PCR

PCR to remove tags from ends of sequence and create overlapping ends with the linearized plasmid backbone

Reaction Setup:

Component	20 µl Reaction
VWR Fast HiFi DNA Polymerase 2X Master Mix	10 µl
10 µM primer 1	1 µl
10 µM primer 2	1 µl
Template- insert	0.5 µl
Nuclease-Free Water	7.5 ul (to 20 µl)

Thermocycling Conditions:

STEP	TEMP	TIME
Initial Denaturation	98°C	30 seconds
25 Cycles	98°C	8 seconds
	65°C	20 seconds
	72°C	1 minute
Final Extension	72°C	4 minutes
Cooldown	22°C	22 seconds

Check results by running PCR product on 1% agarose gel at 110 V for 30 mins

PCR to linearize plasmid backbone

Protocol for Q5® High-Fidelity 2X Master Mix

Reaction Setup:

Component	50 µl Reaction
Q5 or VWR Fast HiFi DNA Polymerase 2X Master Mix	25 µl
10 µM primer 1	2.5 µl
10 µM primer 2	2.5 µl
Template- circular plasmid	1 µl
Nuclease-Free Water	19 ul (to 50 µl)

First dilute plasmid to 5% (1 µl plasmid + 19 µl nuclease-free water)

Thermocycling Conditions:

PCR program:

STEP	TEMP	TIME
Initial Denaturation	98°C	30 seconds
19 Cycles	98°C	7 seconds
	72°C	3 minutes 50 seconds + 8 seconds/cycle
Final Extension	72°C	2 minutes
Hold	22°C

DpnI treatment (to digest methylated DNA, which should target only the circular plasmid backbone and not the linearized backbone generating during PCR). In a PCR tube:

20 ul DNA

1 ul DpnI (add last)

5 ul rcutsmart buffer

24 ul water

Incubate at 37°C for 60 mins, 80°C for 20 mins

Run backbone on 0.8% gel (0.32 g agarose + 40 mL 1x LAB buffer) at 110 V for 40 mins

A lower percentage gel is better for larger sequences as it can move farther down the gel and is thus more easily distinguished

Assemble plasmid using in vivo cloning

Use NEB’s ligation calculator (https://nebiocalculator.neb.com/#!/ligation) to determine the amounts of insert and vector (backbone) required
- Enter vector mass as 100-150 ng
- The vector:insert molar ratio should be 1:2
Once insert and vector volumes have been calculated, add them to a PCR tube
- If you have access to NEBuilder HiFi DNA assembly master mix:
  - Add deionized water to bring the volume to 10 uL and then add 10 u of the mix
  - Mix thoroughly by pipetting and then incubate in a thermocycler for 15 minutes at 50°C
  - Following incubation, store samples on ice or at -20°C for subsequent transformation.

Transformation Protocol:

Pre-warm 3 LB plates containing your selection antibiotic
Thaw three tubes of chemically competent E. coli cells on ice.
- One for assembled product, one with digested backbone as negative control, and one with circular backbone as positive control

Add 2 ul of assembled product to the cell mixture. Carefully flick the tube 4-5 times to mix cells and DNA. Do not vortex.
Place the mixture on ice for 30 minutes. Do not mix.
Heat shock at exactly 42°C for exactly 30 seconds. Do not mix.
Place on ice for 5 minutes. Do not mix.
Pipette 950 µl of room temperature NEB 10-beta/Stable Outgrowth Medium into the mixture.
Place at 37°C for 60 minutes. Shake vigorously (250 rpm) or rotate.
Warm selective plates to 37°C.
Centrifuge the cells at 2,000 rpm/2 minutes.
Discard all but ~100 µl supernatant and resuspend pellet.
Spread 100 µl of the resuspended pellet onto a selective plate (LB gent) and incubate overnight at 37°C.

Colony PCR to confirm presence of assembled plasmid

Use a pipette tip to patch colonies onto fresh plate before adding to PCR tube.

10 ul reaction:

5 ul Taq

0.2 ul primer 1

0.2 ul primer 2

4.6 ul water

Thermocycler conditions:

STEP	TEMP	TIME
Initial Denaturation	95°C	30 seconds
30 Cycles	95°C	15 seconds
	43°C	15 seconds
	68°C	1 minute 30 seconds
Final Extension	68°C	5 minutes
Cooldown	22°C	22 seconds

Run 1% gel at 110 V for 30 mins

If you have successfully assembled your plasmid:

Set up overnight culture for plasmid prep and freezer stock
- 2XYT broth + 20 ug/ml gent
Plasmid prep
- Recommended to have your plasmid sequenced to double check that it has been assembled correctly
Check plasmid concentration via Nanodrop
To make freezer stock of E. coli containing your plasmid: mix 1 mL culture + 100 uL DMSO

Electroporate plasmid into Enterococcus

First you need to make electrocompetent E. faecium or E. faecalis:

First Day-

First thing in the morning – inoculate one single colony in 5 ml of BHI and incubate during the day at 37°C/250 rpm.
At the end of the day, transfer 125 µl of the 5 ml BHI culture to 5 ml BHI 300 mM sucrose + 1% or 2% glycine.
- 1% glycine
  - 250 ul 20% glycine
  - 2.375 ml 2X BHI and 600 mM sucrose
- 2% glycine
  - 500 ul 20% glycine
  - 2.25 mL 2X BHI and 600 mM sucrose
Incubate the culture overnight at 37°C (no agitation!).

Second Day-

Centrifuged the culture for 10 minutes at 3,700 rpm at 4°C.
Add to the pellet 3 ml of Buffer PSM and mix carefully until it is dissolved.
Repeat this procedure twice.
Add 500 μl of PSM (+100 µl 85% glycerol) to the pellet for final suspension.
Make 50 µl aliquots to store at -80°C or use immediately.

Electroporation Protocol:

Mix a 50 µl aliquot of the cell suspension with 1 µg of plasmid (≤1 µl per 50 µl cells if elution buffer was used; ≤5 µl per 50 µl cells if water was used). Keep the mixture and an ice-cold electroporation cuvette (2-mm gap) on ice for 20 minutes before electroporation.
- Use original circular backbone as positive control
- Chill pipettes and pipette tips in -20°C during this incubation period
Transfer the mixture into the ice-cold electroporation cuvette (2-mm gap) and transform cells by electroporation at 200Ω, 25µF, 1.25kV. Time constant should be 2-4.
Immediately after electroporation, add 1 mL ice-cold BHI 300 mM sucrose broth and incubate the cells at 37°C for 2.5 hours.
Spin at 3000 RPM in bench top centrifuge for 2 min and discard all but ~100 µl supernatant.
Gently resuspend the pellet and plate all the cells on a pre-warmed selection plate.
Incubate the agar plate overnight at 37°C or at RT over weekend

Colony PCR to check for presence of plasmid

Use a pipette tip to patch colonies onto fresh plate before adding to PCR tube

10 uL reaction:

5 ul Taq

0.2 ul primer 1

0.2 ul primer 2

4.6 ul water

Thermocycler conditions:

STEP	TEMP	TIME
Initial Denaturation	95°C	30 seconds
30 Cycles	95°C	15 seconds
	43°C	15 seconds
	68°C	1 minute 30 seconds
Final Extension	68°C	5 minutes
Cooldown	22°C	22 seconds

Run 1% gel at 110 V for 30 mins

If you have successfully electroporated your plasmid into your strain, streak a colony containing the plasmid onto a plate with both the antibiotic selection marker and anhydrous tetracycline (ahTC) to activate the tetracycline-inducible promoter and swap out the gene of interest with the gene-less sequence on the plasmid.

Colony PCR to check for clean deletion

Thermocycling conditions:

STEP	TEMP	TIME
Initial Denaturation	95°C	30 seconds
30 Cycles	95°C	15 seconds
	Use NEB Tm calculator to determine annealing temp of primers*	15 seconds
	68°C	3 minutes
Final Extension	68°C	5 minutes
Cooldown	22°C	22 seconds

* https://tmcalculator.neb.com/#!/main

Run 1% gel at 110 V for 30 mins

Once the deletion is confirmed, patch an appropriate colony onto a plate without the antibiotic so that the mutant will lose the plasmid.

The colony PCR to check for plasmid loss is the same as to check for the presence of the plasmid, but this time you obviously do not want to see a band on your gel.

It may be necessary to patch multiple times and/or subculture a colony overnight and streak it out again.

Confirm presence of qPCR-selectable marker (if applicable)

Thermocycler conditions:

STEP	TEMP	TIME
Initial Denaturation	95°C	30 seconds
30 Cycles	95°C	15 seconds
	62°C	15 seconds
	68°C	1 minute
Final Extension	68°C	5 minutes
Cooldown	22°C	22 seconds

Run 1% gel at 110 V for 30 mins

Once plasmid loss, clean deletion, and presence of qPCR marker have all been confirmed, you should make a freezer stock of your mutant before using it in any downstream experiments.

And that’s it! Nice and easy, right? Feel free to comment on this article if you have any questions about this method.

A bug with a sweet tooth: the life and times of vancomycin-resistant Enterococcus faecium

August 25, 2024August 27, 2024 Michelle HallenbeckLeave a comment

*looks at calendar* Wow, it’s been a while since the last time I was here. I have a good excuse- I’ve been busy working on my PhD! I just started my fifth year, so although I still have a little ways to go, I’m finally about to enter the home stretch. Naturally, I’ve started thinking in earnest about what I want to do afterwards, and I’ve come to realize two things: 1) 99% of research is desperately trying to figure out why nothing is working and gradually realizing that no one, not even your PI, has the answers, and 2) doing science isn’t nearly as fun as thinking and writing about it. So I’m going to play to my strengths (and save my sanity in the process) by pursuing a career in science writing. To that end, I’ve decided to revive this blog and use it to regularly write layperson-friendly articles about interesting scientific topics. The obvious place to start is with my own research project, so without further ado, I’m going to tell you a story about a microbe I’ve become very well acquainted with over the past few years: Enterococcus faecium.

The star of the show

Just like there are many different types of people, there are many different types of E. faecium. There’s the friendly commensal kind who live in your gut and are a normal part of a healthy gut microbiome, which helps you digest food and protects you from pathogens. And then there’s the deadly antibiotic resistant kind, which are commonly found in hospitals and cause infections in high-risk patients who are being treated with antibiotics.

The above graphic is from the CDC, and it shows just how deadly VRE can be. Hospitalized patients who are being treated with antibiotics are at the greatest risk of developing a VRE infection, because the antibiotics will kill off gut microbes in addition to their intended target, freeing up prime bacterial real estate. Without the protective barrier of the gut microbiota, VRE is free to grow and spread in the now much less populated gut. Sometimes it grows so much that there’s not enough room in the gut anymore and it spreads to the bloodstream- this is called septicemia, and it’s a very, very bad thing to have. Once bacteria get into your blood, it’s extremely difficult to get rid of them; hence why so many people die from VRE, and why I’m trying to find ways to prevent VRE from getting to that point.

Many bacteria, few resources

All bacteria need to consume nutrients in order to grow. Whichever bacteria becomes most abundant in a microbial community depends on the abundance of different nutrients and their ability to use them- this is called the nutrient niche hypothesis.

In a struggle for survival between the gut microbiota and VRE, the microbiota will win- until their population is decimated by antibiotics. In the aftermath, as the survivors struggle to rebuild, VRE comes in and starts consuming resources… and in the modern Western diet, there are many, many different resources that VRE is particularly suited to take advantage of.

Namely, sugars. Lots and lots of sugars.

Sugar: it’s bad for us, but great for VRE

Sugar consumption has increased by quite a lot over the past hundred years or so, both in terms of overall amounts and the variety of sugars being consumed. Another thing that has increased over the same time span is the number of different sugar transporters and sugar metabolism genes that VRE has. We don’t think this is a coincidence: we think that VRE has evolved to take advantage of increasing sugar availability, and that their ability to metabolize many different sugars gives them a competitive advantage over the gut microbiota. They devour the available nutrients before the microbiota has a chance to fully recover from the antibiotics, expanding quickly to dominate the gut and eventually spread to the blood.

The focus of my research is identifying new targets we can use to combat VRE, to reduce their population in the gut before they can spread to the bloodstream. In particular, I’m looking at their sugar metabolism genes to see whether any of them are useful targets. The idea is that it may be possible to prevent VRE from growing and taking over the gut by either blocking their ability to take in certain sugars or by blocking a particular step in the process of breaking down a sugar after VRE has taken it up, the latter of which will lead to toxic effects on the bacteria due to intermediate products piling up.

So, that’s a broad overview of what I’m doing. I’ll be back with more posts about some of the specific things I’ve been doing in the lab. I’m looking forward to sharing them with you!

Update: check out this blog post about my latest paper!

Anvi’o v5.1: Functional Enrichment Analysis and Computing ANI

July 15, 2018October 23, 2023 Michelle HallenbeckLeave a comment

Hello again, everyone! Since I wrote my last blog post, there has been a new version of anvi’o released with many new and useful capabilities. Two of these I found most helpful in my own work, so I will talk about them here. Both of these apply to pangenomes, so you must have already run anvi-pan-genome before doing anything here. See my original blog post for the pangenomic workflow.

Average nucleotide identity (ANI)

The new command anvi-compute-ani calculates average nucleotide identity (ANI) of the genomes in your pangenome. ANI is essentially a measure of how similar two genomes are at the nucleotide level. The higher an ANI between two genomes, the more closely related they are. To calculate ANI for the genomes in your pangenome, run the following in your terminal:

anvi-compute-ani –external-genomes /home/mkh/external_genomes.txt –output-dir /home/mkh/ANI/ –num-threads 20 –pan-db /home/mkh/Thiomonas_isolate_pan/THIOMONAS_ISOLATES-PAN.db

–external-genomes: You must provide the external genomes text file you used when you created your genomes storage with anvi-gen-genomes-storage.

–output-dir: The directory (folder) ANI will be created for you, so you must specify the file path where you want anvi’o to place the new directory.

You are not actually required to provide the pan database you created with anvi-pan-genome, but if you choose to include the –pan-db flag, you can add the results of the ANI computation to the pan database as additional layer data. If you visualize the pangenome again (with anvi-display-pan) after running anvi-compute-ani, you will not see anything different at first. You will have to go to the ‘layers’ tab and check the box for ‘ANI_percentage_identity’. Then click the ‘Draw’ button, and you will see something like this:

ani_isolate_pan

You may have to go to ‘Order by’ and select ‘ANI_percentage_identity (tree)’ to see your ANI data in an order that helps you visualize how closely related your genomes are (remember to click ‘Draw’ again after this).

When you summarize your pangenome with anvi-summarize, one of the output files you will generate will be a text file, ANI_percentage_identity.txt, that contains the numerical information that is displayed in the graph above.

Functional enrichment analysis

Anvi’o’s new program anvi-get-enriched-functions-per-pan-group allows you to determine which functions are characteristic of any given group of your genomes relative to the rest. Essentially, you divide your genomes into two or more groups (with anvi-import-misc-data) and use anvi-ger-enriched-functions-per-pan-group to determine which functions are enriched for each group relative to all the rest combined. This can be done using any of the functional annotation sources you have available, and can be run more than once if you want to do the analysis using more than one annotation source.

To assign your genomes to groups, run anvi-import-misc-data:

anvi-import-misc-data TAB-DELIMITED-FILE -p /home/mkh/Thiomonas_isolate_pan/THIOMONAS_ISOLATES-PAN.db -t layers

-t layers: see here

The tab-delimited text file you must provide at this step contains the information on which groups you are assigning your genomes to:

misc

You may assign them to groups based on taxonomic classification, phylogenetic tree groupings, isolation sites, or anything you like, depending on the question(s) you are trying to answer. When you visualize the pangenome again after this, it should become part of your layers data that appears in the top right corner of the figure.

Next, we run anvi-get-enriched-functions-per-pan-group:

anvi-get-enriched-functions-per-pan-group -p /home/mkh/Thiomonas_isolate_pan/THIOMONAS_ISOLATES-PAN.db -g /home/mkh/THIOMONAS-GENOMES.db –category-variable cluster –annotation-source COG_FUNCTION ct –functional-occurrence-table-output /home/mkh/Thiomonas_enriched_functions/Thiomonas_COG_occurrence.txt

You need to provide the pan database and the genomes storage with -p and -g respectively (remember to include the file path).

–category-variable specifies the category of groups to be used in this analysis, which is simply the column header in the text file you provided to anvi-import-misc-data. In this example, ‘–category-variable cluster’ specifies the ‘cluster’ column in the text file, which divides the genomes into groups I and II. Running the command as written above would conduct the functional enrichment analysis for genomes in group I vs group II.

-o /home/mkh/Thiomonas_enriched_functions/Thiomonas_COG_functions_cluster.txt will generate the text file to which the results of the analysis will go. There are 14 columns in the text file: category, COG_FUNCTION (or whichever annotation source you specified), enrichment_score, weighted_enrichment_score, portion_occurrence_in_group, portion_occurrence_outside_of_group, occurrence_in_group, occurrence_outside_of_group, gene_cluster_ids, core_in_group, core, wilcoxon_p_value, wilcoxon_statistic, wilcoxon_corrected_p_value. The meaning of each column is described here (scroll down).

–functional-occurrence-table-output is an optional text file output that is simply a presence/absence table of each function in each genome. I highly recommend doing this, as it is very useful information to have. There are instructions here for visualizing the functional pangenome using the functional occurrence table, but it’s actually much more difficult than it looks and I never made it past the fourth step, for technical reasons that are beyond my comprehension. You are welcome to attempt it if you wish, and please let me know how you achieved it, so that I can post it here and actually make it clear what you’re supposed to do and how you’re supposed to do it.

If you have any questions regarding anything in this tutorial, please feel free to let me know!

How to Conduct a Pangenome Analysis using Anvi’o

May 23, 2018June 4, 2018 Michelle Hallenbeck1 Comment

Introduction

If you’re reading this right now, I’m going to assume that you know what a pangenome is, but you don’t know the first thing about how to analyze one. Lucky for you, I had to do that very thing this past spring as part of the research leading up to my senior thesis. Having little to no bioinformatics background going in, I found the process to be long and laborious, and often found myself screaming obscenities at my computer screen. Out of the goodness of my heart, I have decided to save you, dear reader, many months of strife and anguish by writing this tutorial.

I will often refer to this tutorial written by the anvi’o programmers, which provides a good framework for the steps you need to take but makes the assumption that you know something about computer science and bioinformatics, and therefore skipped over steps that I was completely unaware of. I will not make that assumption, and at times you may wonder why I take the time to explain something that seems to be as basic as 2+2=4. However, to the untrained (in computer science) eye, the most basic of concepts may not be immediately obvious. My experience has taught me that if every single minute step is not explicitly stated, I will miss something that is vitally important and a simple task that should have taken no more than an hour will take at least three.

I hope that you find this tutorial to be complete and helpful. If you have any questions, comments, or concerns, please feel free to let me know!

Terminal

You will be doing everything in Terminal, which can be found on a Mac by pressing the F4 key (the one with six little squares arranged in 2 rows of 3), then clicking on ‘Other’. Terminal will be the black rectangle with ‘>_’ in white in the upper left corner. When you click on it, a window will pop up that looks like this:

Basic commands

Using terminal will require you to master some basic commands, including but not limited to: ls, cd, mv, cp, mkdir, and rm. I found this site very helpful, especially section 5: Manipulating Files. You should pay close attention to the ‘Wildcards’ section, which will be very helpful to you later when you are running ‘for loops’.

A note on terminology: ‘directory’ basically just means ‘folder’, and a directory can contain one or more files. You will be working with directories and files frequently throughout this tutorial, so it’s essential that you know how to use the relevant commands. If you find something in the above link to be insufficiently explained, let me know and I will restate it here in a more explicit fashion.

Installing anvi’o

There are many different ways to install anvi’o on your computer, instructions for all of which can be found here. I found the first section, ‘Painless installation with Homebrew’ to be easiest, so I recommend installing anvi’o that way. You should also make sure that you have Python 3 installed on your computer before you try to use anvi’o.

This was relatively simple to do on my own computer (which, by the way, is a Mac, so if you are not using a Mac, some things in this tutorial may or may not be different for you), but if you want to install anvi’o on biomix, the installation process will be long and agonizing (at least it was for me). The Homebrew installation will not work, so you will have to follow the instructions under ‘Installing the latest stable release (safe mode)’. The latest version of anvi’o at the time of this writing is v4.

If you don’t know what biomix is or have never heard of it, you can just ignore every mention of it in this tutorial. If you do know what biomix is, then you probably work at DBI, so until I graduate in the spring of 2019 you can just come find me if you have any questions about the biomix-specific portions of this tutorial (I work in the Chan lab on the first floor, and when I’m at DBI you can usually find me in the lab or in the bullpen).

Remember to run ‘anvi-self-test –suite mini‘ and ‘anvi-self-test –suite pangenomics‘ to make sure you have all your ducks in a row (in case you can’t tell, there are TWO dashes before the word ‘suite’ in both commands). Biomix people: when you run ‘anvi-self-test –suite pangenomics’, you have to stop the test once it gets to the part where it tries to display the pangenome (just type ‘exit’ or ‘end’, I don’t remember which).

Preparing your genomes for analysis

If the installation process has left you exhausted and with a permanent grudge against computers, don’t look for a respite here. You still have a long way to go before you can even conduct the actual pangenome analysis. There are several steps you have to take to get your genomes in the right format for anvi’o to be able to run a pangenome analysis on them.

External vs. internal genomes

The first thing we need to do is differentiate between external and internal genomes. External genomes are just the FASTA files of your genomes that you can download from NCBI or IMG or wherever. Internal genomes are what you get after you have been through anvi’o’s metagenomics workflow. If you have internal genomes, that means you have already been through that workflow and you already know more about how to use anvi’o than I do, so I’m not entirely sure what you’re doing reading a tutorial for beginners. I used only external genomes when I did this, so if you want to do this with internal genomes or a combination of internal and external genomes, I will point you to the not-for-beginners anvi’o pangenomics tutorial and what they say about that.

If you have external genomes, make sure that the file names all end in the same extension (might be .fna, .fa, or .faa, but make sure they all have the same one). Biomix people: you will need to copy your genomes onto biomix by using the command scp in the terminal window on your own computer. Example:

scp -r /Users/michellehallenbeck/Desktop/Thiomonas_genomes/ mkh@biomix.dbi.udel.edu:/home/mkh/Thiomonas_genomes/

You will need to provide the file path of your genomes folder, which is basically just where your genomes are located on your computer.

The -r parameter means ‘recursive’, and when you run this it will recursively copy everything in the folder you specified. It will prompt you for your biomix password.

Check your genomes for completion

The next step is to run the program CheckM on your genomes to make sure they are sufficiently complete. This is not required, but I highly recommend it nonetheless, because there’s not much point in running a pangenome analysis on genomes that are only 20-30% complete. I used a cutoff of 70% complete for my analysis, but this is a purely arbitrary boundary and not a particularly strict one. You can choose whatever minimum percent completion makes sense to you.

You will obviously have to download CheckM in order to run it (unless you are using biomix, which already has CheckM installed). You can do that from the link above.

Note that CheckM assumes that your files all end in the extension .fna, so if your genome files don’t have that extension you will have to specify whichever extension it is you are using by typing -x followed by the three-letter extension (without the period before it).

Biomix people: you will need to start an interactive job before running CheckM (or before running anything else, for that matter). See ‘Interactive SLURM job’ under the link in the previous sentence for more details on the different parameters. This is what I ran every time I needed to start an interactive job:

srun -N 1 -c 12 –mem=316000 –partition=batch –pty bash

To run CheckM on your genomes:

checkm lineage_wf Thiomonas_genomes/ Thiomonas_CheckM_output/

‘Thiomonas_genomes/’ represents the folder containing your FASTA files, and ‘Thiomonas_CheckM_output/’ represents the folder in which CheckM will place the results you got from running it. CheckM will create this folder for you, so don’t make it yourself before running CheckM. You will have to tell CheckM the name of the folder you want it to place its output in.

Once you have run CheckM and know how complete your genomes are, you might decide that some of them are not complete enough for your liking. I found it easiest to create a new folder within my genomes folder and move the not-complete-enough genomes into that folder using the following set of commands:

cd Thiomonas_genomes/

mkdir not_used/

mv file_1 file_2 not_used/

The first of those commands allows you to enter your genomes folder, the second creates the new folder that you want to move your incomplete genomes to, and the third actually moves those genomes into the new folder.

Generate contigs databases

Now we finally get to start using anvi’o! In this step, you will convert each of your FASTA files into an anvi’o contigs database by running the command anvi-gen-contigs-database. For this command and all other anvi’o commands, you can find out more about it and the parameters they require by typing the name of the command followed by -h or –help.

Depending on how many genomes you have, you may or may not find it tedious to run anvi-gen-contigs-database on each of them one by one. For the sake of convenience, I recommend that you run a ‘for loop’ to take care of all your genomes with a single command:

for f in *.fna; do anvi-gen-contigs-database — contigs-fasta $f –project-name THIOMONAS_GENOMES –output-db-path /home/mkh/Thiomonas_genomes/${f}_out.db; done

This will convert each of your genomes into an anvi’o contigs database one by one, but without the tediousness of running the same command 20+ times. Just type the for loop, hit enter, and sit back and let anvi’o work its magic.

Remember to first enter the folder containing your genomes before running any commands on them by typing cd followed by the name of the folder with a backslash:

cd Thiomonas_genomes/

Then you will be able to specify all of the files in that folder that end in .fna by ‘*.fna’. The ‘f’ in ‘for f in *.fna’ is a variable that refers to all of the characters in the file name before the ending ‘.fna’. The for loop will then run the command ‘anvi-gen-contigs-databases’ on every file in the folder that ends in ‘.fna’.

At this point, anvi’o may give you the following error:

Config Error: At least one of the deflines in your FASTA File does not comply with the ‘simple deflines’ requirement of anvi’o. You can either use the script `anvi-script-reformat-fasta` to take care of this issue, or read this section in the tutorial to understand the reason behind this requirement (anvi’o is very upset for making you do this): http://merenlab.org/2016/06/22/anvio-tutorial-v2/#take-a-look-at-your-fasta-file

This means that the headers in your FASTA files have spaces or illegal characters. The header is the identifier at the front of each sequence, like so:

A fasta file =

>header

ATCG…

>header2

ATTC…

>header3

GGGC…

You can fix this by running anvi-script-reformat-fasta:

for f in *.fna; do anvi-script-reformat-fasta –output-file ${f} _cleanedheaders –simplify-names $f; done

This will create a bunch of FASTA files that all end in ‘.fna_cleanedheaders’.

When you run anvi-gen-contigs-database again after fixing your headers, you will need to take this into account:

for f in *.fna_cleanedheaders; do anvi-gen-contigs-database — contigs-fasta $f –project-name THIOMONAS_GENOMES –output- db-path /home/mkh/Thiomonas_genomes/${f}_out.db; done

This will create a bunch of files in your genomes folder that end in ‘.fna_cleanedheaders_out.db’. These are your contigs databases.

Once you have your contigs databases, it’s a good idea to annotate them using anvi-run-ncbi-cogs and anvi-run-hmms, so that when you’re looking at your pangenome later, it actually holds some meaning and you’re not just looking at a bunch of sequences with no indication of what they actually do. You can do this with for loops:

for f in *.db; do anvi-run-ncbi-cogs –contigs-db $f –num-threads 12 –search-with blastp; done

for f in *.db; do anvi-run-hmms –contigs-db $f –num-threads 16; done

Make sure you run ‘anvi-setup-ncbi-cogs’ before you run the first of those. Just type ‘anvi-setup-ncbi-cogs’ into your terminal and you’ll be good to go.

A note on threads: the parameter –num-threads is very important, as the number of threads you choose to use will determine how long your command will take to run. Too few and your command will take the better part of a day; too many and it will crash because you’ve exceeded the capacity of your computer. Basically, each computer has a certain number of cores, and each core has two threads.

I found 16 threads to be optimal, but it may be different for you. I was also using biomix, so I had access to more resources than just what my computer has. If you have a Mac, you can find out how many cores you have by clicking on the apple symbol in the upper left corner, then ‘About This Mac’, then ‘System Report’.

Generate a genomes storage

Now you’re going to take your contigs databases and put them in a genomes storage. This will require you to make a tab-delimited text file listing your contigs databases and their file paths.

The file path is basically just where your file is located, from broad to narrow. For example, if you have been putting everything on your desktop, the file path for one of your contigs databases might look like this:

/Users/michellehallenbeck/Desktop/Thiomonas_genomes/Thiomonas_sp_FB_6.fna_cleanedheaders_out.db

Or if you have been doing everything in biomix:

/home/mkh/Thiomonas_genomes/Thiomonas_sp_FB_6.fna_cleanedheaders_out.db

You can make the text file by first creating an Excel file with two columns: one with the name of your genome, and one with the file path of the corresponding contigs database. Then you just save it as a tab-delimited text file.

If you have internal genomes, you will need to create a separate file for your internal genomes and their file paths, and when you run anvi-gen-genomes-storage you will need to add the flag –internal-genomes followed by the file path of your internal genomes text file.

Biomix people: you will need to copy your text file onto biomix:

scp /Users/michellehallenbeck/Desktop/Thiomonas_contigs_databases.txt mkh@biomix.dbi.udel.edu:/home/mkh/Thiomonas_genomes/

To generate the genomes storage:

anvi-gen-genomes-storage –external-genomes /home/mkh/Thiomonas_genomes/Thiomonas_contigs_databases.txt –output-file /home/mkh/Thiomonas_genomes/THIOMONAS-GENOMES.db

This will create a file ending in ‘-GENOMES.db’. If you run anvi-gen-genomes-storage -h, you will see that it tells you the output file (your genomes storage) has to end in ‘-GENOMES.db’.

Running the pangenome analysis

At long last, we arrive at the actual pangenome analysis. This will be accomplished with the command anvi-pan-genome:

anvi-pan-genome –genomes-storage THIOMONAS-GENOMES.db –project-name THIOMONAS –output-dir /home/mkh/Thiomonas_genomes/ –num-threads 16 –use-ncbi-blast –mcl-inflation 8

The workflow provided by the anvi’o developers actually takes the time to go through each of the parameters of this command, and you can also run anvi-pan-genome -h to learn more about them. At minimum, you will need to provide anvi’o with the name of your genomes storage (no file path this time), the name of the project (whatever you choose), and a file path for the output directory, so that anvi’o knows where to put the results of the pangenome analysis. A new folder containing the results will be created for you; the file path is only telling anvi’o where you want it to put that folder.

Depending on the specific question you are trying to answer, you may want to play around with the other parameters a bit. For example, the –mcl-inflation parameter affects the sensitivity of the program when it is defining gene clusters. The Meren lab recommends using a value of 2 (the default) when comparing distantly related genomes and 10 when comparing very closely related genomes. I chose to use a value of 8, but that was a judgement call on my part; you should use whatever value makes the most sense to you.

You can use either DIAMOND or NCBI blastp for protein search during the analysis. The default is DIAMOND in fast mode; obviously if you don’t have DIAMOND installed, you will need to either install it or use the flag –use-ncbi-blast when you run your pangenome analysis. If you are using DIAMOND, you can instruct it to be sensitive by using the flag –sensitive. It will take longer this way, but it will probably be more accurate.

The –num-threads parameter is also very important here. I used 16 threads when running my analysis, and it took between 2-3 hours. Again, you should check to make sure how many threads you can actually use before you start running anything.

I didn’t alter any of the other parameters when I ran my analysis either because they didn’t apply to the question I was trying to answer or because I wasn’t sure what they meant. If you aren’t sure what one of the optional parameters signifies, my advice would be not to touch it. You can always run the analysis again with different parameters if you want to experiment or if something went wrong the first time.

Viewing your pangenome

Once you have run your pangenome analysis, you can look at your results with the command anvi-display-pan.

If you have been using biomix up until this point, you need to get off it and do the rest of this tutorial on your own computer. The anvi-display-pan command requires both the pangenome you just created and your genomes storage, so you need to copy both of them onto your own computer from biomix (log out of biomix before typing the following commands):

scp -r mkh@biomix.dbi.udel.edu:/home/mkh/Thiomonas_genomes/Thiomonas_pangenome/ /Users/michellehallenbeck/Desktop/

scp mkh@biomix.dbi.udel.edu:/home/mkh/Thiomonas_genomes/THIOMONAS-GENOMES.db /Users/michellehallenbeck/Desktop/

The first time you run the anvi-display-pan command, it will look like this:

anvi-display-pan –pan-db /Users/michellehallenbeck/Desktop/Thiomonas_pangenome/THIOMONAS-PAN.db –genomes-storage /Users/michellehallenbeck/Desktop/THIOMONAS-GENOMES.db –title ThiomonasPangenome

You will see something that looks like this:

This is not how my pangenome looked when I first displayed it; this is after I played around with the interface a bit and grouped my gene clusters into bins.

A ‘bin’ can be thought of as a box of gene clusters that can be found in a certain subset, or all, of your genomes. A gene cluster basically consists of all of the copies of a gene that are in a pangenome. You should group as many of your gene clusters as you can into bins; you will be grateful for this later when you summarize your pangenome.

You should play around with the interface and the different ways of displaying your results until you find an arrangement that works best for you. I found it best to go to the ‘Samples’ tab under the ‘Settings’ panel on the left, going to ‘sample order’, and selecting ‘gene_cluster frequencies’:

This will organize your genomes based on gene clustering results and makes it easier to see which gene clusters should be grouped into bins. You can also change the colors of your bins and of your genomes in the ‘Settings’ panel as well.

When you are playing around with the interface, you can save the state your display is in at the present moment by going to the ‘Settings’ panel. You should be in the ‘Main’ tab, and there should be a ‘save state’ button:

You can overwrite your previous save, or give it a new name. Every time you display your pangenome thereafter, you can specify which state you want to load with the parameter –state-autoload followed by the name of the state.

If you want to inspect any of your gene clusters, right click on it and select ‘Inspect gene cluster’. You will see a bunch of amino acid sequences, in the same order as they appear in the display. To view function annotations, just click on one of the sequences and you will see its function annotation.

To make a bin, first create the bin by going to the ‘Settings’ panel and clicking on ‘Bins’:

You can create as many bins as you want and name them whatever you wish. Make sure you have selected the bin you want to add gene clusters to, and while this panel is open, zoom in on the tree in the middle of your display. When you move your mouse over the tree, you will see that you can make different selections from the tree, and a certain slice of your pangenome will be selected depending on where you are in the tree. Once your mouse is over the selection you want, click once and the gene clusters in your selection will be added to whichever bin you selected in the left panel.

Be very careful with this! If you inadvertently click on a gene cluster when you are trying to double click, you may accidentally add that gene cluster to a bin you didn’t intend to add it to. That’s why I recommend that every time you have made a bin that you actually want, you save it by clicking the ‘store collection’ button in the ‘Bins’ panel. This will save all of the bins you have at that point. You can overwrite your previous save, or change the name of your collection so that you have saved the different steps of your bin-making process.

When you subsequently view your pangenome again, you can specify which bins you want to see in your interface with the parameter –collection-autoload followed by the name of your desired bin collection.

Summarizing your pangenome

Once you are done binning your gene clusters and your pangenome looks as pretty as you can make it, you can summarize your results with the command anvi-summarize:

anvi-summarize –pan-or-profile-db /Users/michellehallenbeck/Desktop/Thiomonas_pangenome/THIOMONAS-PAN.db –genomes-storage /Users/michellehallenbeck/Desktop/THIOMONAS-GENOMES.db –collection-name default –output-dir /Users/michellehallenbeck/Desktop/PAN-SUMMARY

As with anvi-display-pan, you need to specify the file paths of your pangenome and your genomes storage. You also need to specify the folder in which you want anvi’o to place the summary files.

You must name this folder yourself!!! Do not, I repeat, DO NOT just specify the file path where you want anvi’o to place the new folder without naming the new folder itself and run the command, because it will overwrite (meaning delete) everything in the area you have specified. I made this mistake the first time I ran anvi-summarize, and anvi’o deleted all the bins that I had spent many painstaking hours making. Fortunately, I had backed up everything to an external hard drive not more than an hour before, so in the end I didn’t lose all my hard work. You’re welcome for the warning.

The output of the anvi-summarize command consists of a static html page that opens in Google Chrome and summarizes the details of your pangenome, and a text file listing your gene clusters, their genomes, their bins, and function annotations. I found it convenient to save the text file as an Excel file, for ease of reading and organizing the gene clusters by bin (using the filter function in Excel).

Conclusion

I hope you have had fun on this little adventure. If you have any questions, comments, or concerns regarding any part of this tutorial, please feel free to leave a comment!

The Science Storyteller

Your favorite science topics explained simply

Author: Michelle Hallenbeck

How to generate clean deletion mutants in Enterococcus

Designing the plasmid

Desiging all necessary primers

PCR to remove tags from ends of sequence and create overlapping ends with the linearized plasmid backbone

PCR to linearize plasmid backbone

Assemble plasmid using in vivo cloning

Colony PCR to confirm presence of assembled plasmid

Electroporate plasmid into Enterococcus

Colony PCR to check for presence of plasmid

Colony PCR to check for clean deletion

Confirm presence of qPCR-selectable marker (if applicable)

A bug with a sweet tooth: the life and times of vancomycin-resistant Enterococcus faecium

The star of the show

Many bacteria, few resources

Sugar: it’s bad for us, but great for VRE

Anvi’o v5.1: Functional Enrichment Analysis and Computing ANI

How to Conduct a Pangenome Analysis using Anvi’o