We at ContentMine are a non-profit NGO from Cambridge UK, who are practitioners on the forefront of text data mining - the free and open way. Here we summarize our insights and how you can to TDM in practice.
Text data mining is a wide field, and we focus on scholarly literature. Most of of our sources are Open Access, a precondition for unimpeded mining, , and we develop open source software to do so.
TDM in practice
But why do text data mining, or content mining as we call it, with scientific publications? A central purpose is to explore massive numbers of publications and find out about whole research fields: how they evolved, who is important, how the use of language changed, which instruments or methods are used, or what drug, gene, species or spatial entity is mentioned. Often you also want to find patterns of entity occurrences or get statistical aggregates of them. To be more concrete: It is not difficult to count all genes named in one publication, but what if you want to count them in all medical studies on a specific virus after the outbreak to get a better understanding of it? There are maybe thousands or even more pages to read, nothing that can be done quickly and without massive resources used. So here content mining and open access helps immensely, especially when it is urgent and of high importance.
We at ContentMine applied open methods for years, and use them also for content mining. Our main repository is Europe PMC (Open Access) with over 4 million publications in fulltext available under open licences. When you want to start with our software, there are two steps to do:
There are no real requirements to get your hands on our software or to start withcontent mining, but a good understanding of computers and some experience in any programming language can help you with your first steps and allow you to understand some of the magic behind it. Our resources will guide you through the installation as well as help you with your first content mining activities.
How you can use our contribution to FutureTDM to explore TDM
We brought our practical expertise into the FutureTDM project. The main outcome are three tutorials on specific content mining use-cases.
The zika virus is a global pandemic with not too much research done before its latest outbreak. The tutorial shows the power of content mining for an urgent and important event such as the start of a global pandemic. Special importance in our tutorial lies on extracting information about species and to get a general understanding of the development of the research field itself.
Then we used the zika corpus for a systematic literature review (SLR). SLR’s try to bring some sort of order into the messy scientific publishing world. The tutorial shows how to filter out non-relevant publications and find the ones you are looking for.
The third one focuses on statisticians. P-hacking is all around, and many are thinking of even banning the useage of the p-value completely in their research. But why not explore the statistical measures yourself and extract them for further analysis? Here we help you to find them in a corpus through applying regular expressions.
All of the tutorials as all other materials we created can be found in our project repository on GitHub. Have a look at it, we are also happy about feedback.
Other things we did
Besides the tutorial, we also advocated for free and open access to scientific knowledge and a text data mining exception in the EU copyright reform. In our participation in the Brussels workshop II we made our case and gave examples for the barriers we experience as daily practitioners.
We also trained researchers in our software at the ELPUB 2017 conference, where we organized a workshop together with Open Knowledge. In an awareness sheet, we summarized what we do and the barriers we face. Finally, we presented our work at the International Data Science Conference 2017 in Salzburg and received interesting feedback from the text data mining community - from lawyers to coders.
Moving forward
The FutureTDM project was an important step for us to understand the needs of others in the field and it was a great opportunity to share our open approach. We want to thank all our partners, all participants and the team for supporting us in our work to open up and enable text data mining for all. All our outcomes are openly available in our GitHub repository.
“The Right to Read is the Right to Mine”
Header image: Map global aedis aegypti distribution by Moritz Kraemer et al, CC0