Playing with neural networks – new scientific names for bacteria

It was recently Valentine’s Day, and I came across the story of an AI that had come up with some innovative messages to print on candy hearts. Over Christmas, my sister (who never fails to bring joy into my life) had shown me a new Harry Potter story that had been written by an AI, and I was extremely taken with this new source of mirth in my life. Needless to say, it wasn’t long before I wondered whether I could put neural networks to my own uses.

Via an aiweirdness post about recipes, I found the open-source neural network code that Janelle Shane used, and I thought I should give it a try. After some struggle learning how to navigate in Mac’s terminal and figuring out all the things I had to install to get the model to run – I did it. I ran a neural network! My fiancé learned to code his own neural networks a few months ago just for fun, so I was expecting it to be a much more involved process. But karpathy did the hard work of writing the neural network code; I just had to implement it.

Oh, and curate my chosen dataset.

A list of the scientific names of all known bacteria.

In the end, I used this list of known species of microbe curated by the Swiss Institute of Bioinformatics, the European Bioinformatics Institute, and the Protein Information Resource in the US. I kept only the entries marked ‘B’ for bacteria and eliminated all subspecies/serotype/serovar/etc information (it was difficult to handle elegantly). I removed all redundant entries and gave it to the neural network as a .txt file.

The first problem that I had was that the output of the network showed only certain genus names repeating over and over:

 \orynebacterium thermofolmans
 \orynebacterium thermorilistranicum
 \orynebacterium sp.
 \orynebacterium sp.
 \orynebacterium hymofrevicum
 \orynebacterium glutamicum
 \orynebacterium emmonis
 \orynebacterium pseudotuberculosis
 \orynebacterium emoraosium
 \yxobacterium tuberculosis
 \orynebacterium trephomyces

I figured that this was because the source dataset was alphabetized by genus. I randomized it and tried again.

The next problem (which is also noticable above) was that there were unexpected “\” characters. I thought at first that this was the program’s was of saying that this should all be the same letter – in this case, “C” – but with some assistance from the fiance I learned that my source file had hidden characters that Mac’s TextEdit program wasn’t able to reveal. However, by downloading Atom I was able to remove the backslashes and some header text.

This time it worked. After going through the dataset twice, the network hadn’t quite figured out what was going on:

 ruc iub
 neitaevp ilmaBcaspei ebosevicrorauo
 vaacoumimoniilfmtxuo co
 uilirsaotaiulltntilnntsnaviris
 mtimtepoturnonuim sicaotansteo
 gclgia
 misgibgaeia
 rRrroryrceoaisobhrlmucia  ctrilyscndte
 sofitilibicis iilitialicolun
 Fiacrissfdllo
   dealie snucoopel

But after 15 times through the dataset it was getting closer, giving me such organisms as

 Levionella nokheae
 Staphylococcus thermocheonosulforum
 Pranktomonas destothilolis
 Brucella centus
 Hallicellulosilum teprolicum
 Naelochrorgoiphilus salobacillus
 Tseudomonas cynnomonas
 Lactobacillus chiauldarus

Finally, after 50 times through the dataset, it gave some pretty plausible sounding names. It doesn’t understand that it can mix and match Latin words and roots, but it manages to create a Latinate sound:

 Clostridium lenitoremoens
 Sulfobacillus amyloletic
 Pelobacter protiosum
 Chlarydiphala bifica
 Legionella acidophila
 Enterobacter subterraneus
 Methylobacter marinus
 Rickettsia rickettsii

On the other hand, it gave some names that I don’t think will appear in the ATCC any time soon:

 Gluconacetobacter dinzettii
 Pdevotella mutgli
 Sphingobium jyanniculatiens
 Maginococcus walenii
 Kipnella avinovirgdis
 Klubacellulfa streptococcus mutans
 Bradyrhizobium eldenii
 Erwinia rubinae
 Acidaminococcus glomicola
This was a lot of fun, and I fully intend to keep playing around with this neural network and feeding it interesting biology-related datasets. In fact, as I write this, I am training a network to write the title and abstract of a supposed paper about Myxococcus xanthus, my lovely study organism.

And hey! If we ever run out of informative Latin names for microbes, I can help produce some less informative but perhaps more fun suggestions the same way that Janelle Shane helped to name rescued guinea pigs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s