Recently I found a very neat program called Capti Narrator, which does a pretty good job of reading texts aloud. It works on an iPhone and it’s free. Well this is great I thought – perhaps finally I would be able to catch up on some of all those articles I want to read?
Well it turns out Capti isn’t great for full-blown journal articles. But: I do find it excellent for all the extra stuff journals are filled with, like the perspectives and news content – with a little tweaking. Here I’ll share the script (code here) I made for improving the listening experience and for automatically scraping all the “news and views” articles from a Nature journal. The end result sounds like this:
I had a number of issues with Capti “out of the box” for scientific texts:
- It tends to read the text in one big blob – as a listener it’s hard to figure out where the paragraphs are or something is actually a title or figure legend.
- As Capti is reading, it will read all references out loud as well as URL’s and other stuff you basically don’t care about.
- For these reasons, for a good “listen”, you often need to copy-paste the text into a text-file, where you check that all paragraphs are neatly lined up – and then combine articles by hand.
Now, I thought all these things were a little annoying, so I made an R script which basically does the following:
- It scrapes all the “news and views” and “research highlight” articles from a current issue of a Nature journal, in my case Nature Immunology and combines them in one text file.
- It puts a distinguishing text in front of the title, author, abstract, the date and each paragraph. This way, when a text is read aloud, the program will read “Title: ‘Central tolerance: what you see is what you don’t get!’”, put a “New paragraph” before each new subsection and so on. It makes a little easier to listen to the text and follow along.
- It scrapes all the paragraphs from the articles and removes a number of things: all links, all the weird line-breaks and all the references. It also remove the figure texts.
- Finally it saves it all in a text file in a folder of your choice: since Capti can sync with Dropbox, I just save it here and voila – an hours worth of listening is ready.
The voice you listened to above is called “Joey” (which cost 4 dollars) – for the amount of weird scientific words in your standard journal article, I think it does a pretty good job
I’ll write a detailed post about the code later, but here are a few notes:
- There are plenty of things that could be done to further automate this: like going though a list of all your favorit journals, always finding things you haven’t listen too, ect – but for now, it’ll just give you one journal at a time.
- This only works with Nature journals.
- I used the rvest package for most of the scraping, but reverted to regex expression for the text itself – I know, I know! – you aren’t suppose to do that. But it turned out it was a lot easier to remove all the references and links this way, since they are in distinct html tags.
- For the text manipulation I used another Hadley Wickham package: stringr package – and this awesome free book from Gaston Sanchez’s website called “Handling and Processing Strings in R”.
The code can be found here.