It was recently Valentine’s Day, and I came across the story of an AI that had come up with some innovative messages to print on candy hearts. Over Christmas, my sister (who never fails to bring joy into my life) had shown me a new Harry Potter story that had been written by an AI, and I was extremely taken with this new source of mirth in my life. Needless to say, it wasn’t long before I wondered whether I could put neural networks to my own uses.
Via an aiweirdness post about recipes, I found the open-source neural network code that Janelle Shane used, and I thought I should give it a try. After some struggle learning how to navigate in Mac’s terminal and figuring out all the things I had to install to get the model to run – I did it. I ran a neural network! My fiancé learned to code his own neural networks a few months ago just for fun, so I was expecting it to be a much more involved process. But karpathy did the hard work of writing the neural network code; I just had to implement it.
Oh, and curate my chosen dataset.
A list of the scientific names of all known bacteria.
In the end, I used this list of known species of microbe curated by the Swiss Institute of Bioinformatics, the European Bioinformatics Institute, and the Protein Information Resource in the US. I kept only the entries marked ‘B’ for bacteria and eliminated all subspecies/serotype/serovar/etc information (it was difficult to handle elegantly). I removed all redundant entries and gave it to the neural network as a .txt file.
The first problem that I had was that the output of the network showed only certain genus names repeating over and over:
\orynebacterium thermofolmans \orynebacterium thermorilistranicum \orynebacterium sp. \orynebacterium sp. \orynebacterium hymofrevicum \orynebacterium glutamicum \orynebacterium emmonis \orynebacterium pseudotuberculosis \orynebacterium emoraosium \yxobacterium tuberculosis \orynebacterium trephomyces
I figured that this was because the source dataset was alphabetized by genus. I randomized it and tried again.
The next problem (which is also noticable above) was that there were unexpected “\” characters. I thought at first that this was the program’s was of saying that this should all be the same letter – in this case, “C” – but with some assistance from the fiance I learned that my source file had hidden characters that Mac’s TextEdit program wasn’t able to reveal. However, by downloading Atom I was able to remove the backslashes and some header text.
This time it worked. After going through the dataset twice, the network hadn’t quite figured out what was going on:
ruc iub neitaevp ilmaBcaspei ebosevicrorauo vaacoumimoniilfmtxuo co uilirsaotaiulltntilnntsnaviris mtimtepoturnonuim sicaotansteo gclgia misgibgaeia rRrroryrceoaisobhrlmucia ctrilyscndte sofitilibicis iilitialicolun Fiacrissfdllo dealie snucoopel
But after 15 times through the dataset it was getting closer, giving me such organisms as
Levionella nokheae Staphylococcus thermocheonosulforum Pranktomonas destothilolis Brucella centus Hallicellulosilum teprolicum Naelochrorgoiphilus salobacillus Tseudomonas cynnomonas Lactobacillus chiauldarus
Finally, after 50 times through the dataset, it gave some pretty plausible sounding names. It doesn’t understand that it can mix and match Latin words and roots, but it manages to create a Latinate sound:
Clostridium lenitoremoens Sulfobacillus amyloletic Pelobacter protiosum Chlarydiphala bifica Legionella acidophila Enterobacter subterraneus Methylobacter marinus Rickettsia rickettsii
On the other hand, it gave some names that I don’t think will appear in the ATCC any time soon:
Gluconacetobacter dinzettii Pdevotella mutgli Sphingobium jyanniculatiens Maginococcus walenii Kipnella avinovirgdis Klubacellulfa streptococcus mutans Bradyrhizobium eldenii Erwinia rubinae Acidaminococcus glomicola
And hey! If we ever run out of informative Latin names for microbes, I can help produce some less informative but perhaps more fun suggestions the same way that Janelle Shane helped to name rescued guinea pigs.