An R script for scraping news from Nature journals to Capti Narrator

Recently I found a very neat program called Capti Narrator, which does a pretty good job of reading texts aloud. It works on an iPhone and it’s free. Well this is great I thought – perhaps finally I would be able to catch up on some of all those articles I want to read?

Well it turns out Capti isn’t great for full-blown journal articles. But: I do find it excellent for all the extra stuff journals are filled with, like the perspectives and news content – with a little tweaking. Here I’ll share the script (code here) I made for improving the listening experience and for automatically scraping all the “news and views” articles from a Nature journal. The end result sounds like this:

I had a number of issues with Capti “out of the box” for scientific texts:

  • It tends to read the text in one big blob – as a listener it’s hard to figure out where the paragraphs are or something is actually a title or figure legend.
  • As Capti is reading, it will read all references out loud as well as URL’s and other stuff you basically don’t care about.
  • For these reasons, for a good “listen”, you often need to copy-paste the text into a text-file, where you check that all paragraphs are neatly lined up – and then combine articles by hand.

Now, I thought all these things were a little annoying, so I made an R script which basically does the following:

  1. It scrapes all the “news and views” and “research highlight” articles from a current issue of a Nature journal, in my case Nature Immunology and combines them in one text file.
  2. It puts a distinguishing text in front of the title, author, abstract, the date and each paragraph. This way, when a text is read aloud, the program will read “Title: ‘Central tolerance: what you see is what you don’t get!’”, put a “New paragraph” before each new subsection and so on. It makes a little easier to listen to the text and follow along.
  3. It scrapes all the paragraphs from the articles and removes a number of things: all links, all the weird line-breaks and all the references. It also remove the figure texts.
  4. Finally it saves it all in a text file in a folder of your choice: since Capti can sync with Dropbox, I just save it here and voila – an hours worth of listening is ready.

The voice you listened to above is called “Joey” (which cost 4 dollars) – for the amount of weird scientific words in your standard journal article, I think it does a pretty good job

I’ll write a detailed post about the code later, but here are a few notes:

  1. There are plenty of things that could be done to further automate this: like going though a list of all your favorit journals, always finding things you haven’t listen too, ect – but for now, it’ll just give you one journal at a time.
  2. This only works with Nature journals.
  3. I used the rvest package for most of the scraping, but reverted to regex expression for the text itself – I know, I know! – you aren’t suppose to do that. But it turned out it was a lot easier to remove all the references and links this way, since they are in distinct html tags.
  4. For the text manipulation I used another Hadley Wickham package: stringr package – and this awesome free book from Gaston Sanchez’s website called “Handling and Processing Strings in R”.

The code can be found here.

Skriv et svar