Is there a better way to write your academic papers? New research says yes
按 Lucy Goodchild van Hilten
To measure the impact of style in academic writing, an information systems professor at Arizona State University is using Elsevier data, NLP and machine learning.
There are certain style conventions in academic writing: impersonal language, past tense, factual, clear and passive. But does all this result in a better academic paper? If one of the measures of success is recognition by one’s peers via citations, the answer might be no.
Dr Reihane Boghrati(在新的选项卡/窗口中打开), an Assistant Professor of Information Systems in the WP Carey School of Business(在新的选项卡/窗口中打开) at Arizona State University, has been using Elsevier data to explore this question. Her results suggest it may be time to rethink some of our assumptions. When it comes to academic articles, how you write might be just as interesting as what you write.
With a PhD in Computer Science from the University of Southern California(在新的选项卡/窗口中打开), Dr Boghrati enjoys using cutting-edge methods to study language and explore things like the impact of writing style in academia. She studies psychological phenomena expressed in written and spoken language using natural language processing and machine learning methods.
Having access to data through Elsevier’s International Center for the Study of Research (ICSR) meant Dr Boghrati was able to look at why certain things catch on and others don’t, as she explained:
There’s a debate about why things succeed in the marketplace of ideas, and of course quality matters a lot. But I would suggest that communication style also plays an important role. In academic research, people may think writing is just a way to communicate the truth, but we wanted to find out what impact writing style has beyond this.
Working with colleagues Dr Jonah Berger(在新的选项卡/窗口中打开), a professor at the Wharton School at the University of Pennsylvania(在新的选项卡/窗口中打开), and Dr Grant Packard(在新的选项卡/窗口中打开), Associate Professor of Marketing at the Schulich School of Business at York University(在新的选项卡/窗口中打开) in Canada, Dr Boghrati compiled a corpus of about 75,000 peer-reviewed articles from 49 journals, published in 1990-2018. Of the 75,000 articles, 40,000 were full-text articles supplied by ScienceDirect and the remainder were from a humanities source. They acquired the full text of the articles, as well as information like the title, issue, and authors, to approach the question from various angles.
Quantifying the impact of writing style
The first challenge Dr Boghrati and her colleagues faced was determining how to define writing style and link it to an outcome. Writing style is the way an author communicates an idea. Different writing styles vary in their use of certain types of words, complexity and perspective. Dr Boghrati shared the following example:
I enjoyed Jasper Grille
The Jasper Grille was really enjoyable.
While these sentences convey the same meaning, they do so using different writing styles. Prior research has shown that a small class of words called function words reflect writing style regardless of content, making them an ideal measure of style in academic writing. Function words include conjunctions, auxiliary verbs, and prepositions — words like “and,” “on” and “the.” In the example, one sentence uses personal pronouns, and the other uses an article and common adverb.
These function words make up a tiny portion of the human vocabulary — only about 0.04 percent in English — but we use them in every sentence to bind the nouns, verbs and adjectives that make up the meaningful content. “Function words don't receive much attention and are mostly treated as junk,” Dr Boghrati said. “But they are specifically valuable here because they capture style rather than content.”
In this study, Dr Boghrati measured the number of function words used in articles to see if there is a correlation with citations. They measured the word count of each of the nine function word categories: auxiliary verbs, conjunctions, negations, grammatical articles, prepositions, personal pronouns, impersonal pronouns, qualifiers and common adverbs. To do this, they used Linguistic Inquiry and Word Count (LIWC)(在新的选项卡/窗口中打开) — a tool for the social sciences that has word lists for different categories and calculates the incidence rate of a category in a given piece of text.
Style matters beyond control factors
The drivers of citations of a given publication are not simple to analyze because there are many factors that feed into it. For example, an article published in a journal with a high CiteScore is more likely to have a high citation count than an article published in a journal with a lower CiteScore. Articles published a long time ago are also likely to have more citations than recently published articles.
It was important to remove the impact of any factors other than style from the analysis. Dr Boghrati looked at previous research to determine what could affect citation count. Structural aspects of the article could make a difference, including the length of the article, title and abstract, the number of references and the order in which the article appears in the journal. Factors like the author’s prominence, institution and gender and the number of co-authors also affect citation count. And content-related factors, including the topics covered and whether the article is theoretical or more empirically based, make a difference.
They controlled for all these factors in the analysis. To control for topic, they used a variable adapter topic modeling method called Latent Dirichlet Allocation (LDA). Rather than assuming that each article is about only one topic, it allows each article to be represented as a mixture of different topics. LDA calculates the probability of different topics occurring in an article based on the text; Dr Boghrati included each article’s topic probabilities as a control.
With these controls in place, Dr Boghrati used negative binomial regression to analyze the link between style and citation count. “Our results suggest that above and beyond any variance that is explained by our control features, style helps explain how many citations an article receives,” she said.
An exploratory analysis
Dr Boghrati had shown that style is important, but would it be possible to gather information authors could use about how style makes a difference? She decided to carry out an exploratory analysis to answer the question.
In general, academic articles follow a standard structure:
Introduction and literature review
Methods and results
Discussion and conclusion.
Dr Boghrati noticed that each of the three segments may require a different writing style. In the first section, the author is talking about the topic, why it’s important and what previous research has shown. In the second part, they are describing what they did in the research and analyzing their results. In the third part, they are discussing the results, limitations and future directions.
Dr Boghrati separated out these parts of the full text using a mix of rule-based algorithms and machine learning models. Taking the same approach as for the full text, she analyzed each section of the articles separately. She also looked at a few language features: simplicity, personal voice and tense.
Simplicity: papers with simpler writing in the first section had higher citation counts. Simpler writing uses fewer articles and prepositions. For example, articles ask readers to make a distinction between a single case or class of something — the car versus a car. The papers with fewer articles and prepositions in the first section had higher citation counts. Dr. Boghrati suggested why this might be: “Academic ideas are often pretty complex, so if we can communicate them in a simpler way at the beginning of the paper, it might attract the reader and increase the citation impact.”
Personal voice: papers using personal voice in the methods and results had lower citation counts. In academic writing, we are often advised against using personal pronouns – I and we. Dr Boghrati’s results show this isn’t always effective; papers with personal pronouns in the first section actually had more citations. But in the middle section, the methods and results, more impersonal language is linked to higher citations. “If you say, ‘we show’ rather than ‘the results show,’ it might make seem as though the results are driven by the author’s choices rather than being objective,” Dr Boghrati said.
Temporal perspective: papers using present tense had higher citation counts. When writing about research, it’s common practice to write in the past tense. But is this the most effective approach? Dr Boghrati used temporal language in general and auxiliary verbs more specifically to test this. Auxiliary verbs, such as be, do and will, can put writing in the past, present or future tense. The results showed that papers with more past-focused auxiliary verbs are cited less, while papers with more present-focused auxiliary verbs are cited more. “This might be because using present tense may suggest the content is more current, relevant, applicable and important,” Dr Boghrati said.
The value of research data
The corpus of articles Dr. Boghrati and her colleagues used to produce these results came from two main sources: ICSR Lab and a humanities source. The ICSR Lab-sourced data was easier to use, according to Dr. Boghrati: it was possible to separate out elements like the abstract and references easily, and it didn’t require much preparation. By comparison, the other data source needed cleaning up before analysis was possible. “In many cases the abstract or references were intertwined with footnotes, page numbers or author’s notes. It took us quite some time to write a code to clean up the text, and it was still not 100 percent accurate.”
Citation counts were available through Scopus for the ICSR Lab data, but for the other source, Dr Boghrati had to go through the long manual process of collecting information via Google Scholar. There were also marked differences in the accessibility and usability of the data, and Dr Boghrati noted that certain aspects of the analysis were only possible with the ICSR Lab data from Elsevier. In particular, identifying authors’ institutions was only possible with the ICSR Lab data — data from a different source would have required a manual search.
If we didn't have access to ICSR Lab, we wouldn't have been able to do the analysis. When you have a big dataset, it’s just not feasible to search for every author on the internet and get their institution. So, it was very important for us to have the data from ICSR Lab.
Overall, the results provide valuable information that give us cause to question whether certain standard practices are effective in academic writing, assuming that one of the goals is influence on the work of other academics, as reflected in citation rates. Dr. Boghrati applied the findings when writing her own manuscript about this study. But she cautions against jumping to conclusions. “This is exploratory – while we show the effects in three controlled experiments, future work can look into other language features and domains.”
You can read about Dr Boghrati’s previous work on her Wharton Risk Center webpage(在新的选项卡/窗口中打开).