We live in the era of Dataism – said David Brooks years ago and, more recently, Yuval Noah Harari who both used this term to describe the Big Data Revolution. Yet, data are created and used by people and, sometimes, people may handle or analyze data carelessly whether intentionally or not.
Duke’s Office of Scientific Integrity invited the research community to reflect on the limitations of research data and what it means to “respect” data. The most recent research town hall, which filled the Penn Pavilion with more than 240 people engaged in research, focused on key challenges and strategies for resolving those data challenges, as presented by three distinguished Duke faculty members: Dan Ariely (James B. Duke Professor of Psychology and Behavioral Economics), David Carlson (Assistant Professor of Civil and Environmental Engineering and Biostatistics and Bioinformatics) and Steven Grambow, Assistant Professor of Biostatistics and Bioinformatics. The event was led by Larry Carin, Vice President for Research and James L. Meriam Professor of Electrical and Computer Engineering.
“To do science well and stay out of trouble it is very important to understand the human limitations of using and misusing data. How do you handle a situation when you are 100% certain that your hypothesis is correct, but your data does not fully agree with it? Or, when 97% of your data agrees and 3% does not agree with your hypothesis?” - introduced Dr. Carin, one of the biggest challenges to respecting research data is the human bias.
In his delightful, engaging talk about research data and the danger of self-deception, Dan Ariely spoke about his research on dishonesty. “This is a story of how we make trade-offs between human values. We have human values and sadly not all of them are pointing in the same direction all the time. What do we do when those values do not fit? Which ones give up? We trade up other values with honesty. From answering to “honey, how do I look in this dress?” - to going late to a dinner with friends and lying that we were stuck in traffic. ”
Dan Ariely: “We need to celebrate all the results and not just the results that agree with our initial intuition”
To date, more than 50,000 people have been subjects in Ariely’s research studies where they were observed in experiments aimed to better understand why and when people cheat. “It is not about fear of punishment or the amount of money that we pay people. It is all about the drive - it comes from conflict of interest - and our ability to bend reality and to justify it in the way we want - wishful blindness.”- said Professor Ariely. He learned that people who are creative tend to cheat more “and this should make us feel worried”: “This happens because creative people also find creative ways to rationalize and justify their dishonesty “for the greater good.” Also, cheating goes up significantly when people are invited to cheat by an individual with authority or when they know that if they cheat they serve a greater good. Cheating for the greater good of science can, therefore, be an irresistible temptation.
Perhaps Ariely’s most enlightening example was his recollection of an experiment performed during his time at Harvard. The inclination of a study team to remove the responses of a particular participant who was pulling the data down, and therefore contradicted the research hypothesis, illustrates the temptation for researchers to remove outliers or values perceived as insignificant. “We need to celebrate all the results and not just the results that agree with our initial intuition,” concluded Ariely.
To protect ourselves from the danger of self-deception, Professor Ariely invited us to recognize up front any conflict of interest and to keep the ethics conversation alive. “Ethics is a little bit like a diet. You can’t take it only three days and then be OK with it. We need to continuously think and debate about it.”
Asked by someone in the audience if increasing punishments can reduce dishonesty, Ariely responded: “States with death penalty do not have lower crime rates. We need to alleviate the multiple pressures and be as close to a pressure free environment as possible,” while keeping the ethical debate alive.
Steve Grambow thinks that better than data policing is “team science”
To those who see statisticians as research gatekeepers, Dr. Steve Grambow, Assistant Professor of Biostatistics and Bioinformatics, responded that there is a better way than Data Policing: Team Science. This approach is especially valuable nowadays, when “the research process is extremely complex and the complexity of science is often underappreciated”. Steven Grambow focused his talk on Common Data Errors & Abuses throughout the Research Cycle. “Science is not broken, it got a lot harder”, he said, reminding the audience that research data is not something that exists “in itself”: Data has context and all the information about how it was collected, managed, and analyzed are crucial. “Teaching statistics out of context is wrong”, said Prof. Grambow, who stands for transparent, rigorous and reproducible methods for all the stages of the research cycle, from data collection to data analysis. In supporting his opinion, Steve Grambow talked about the “viral” article of John P.A. Ioannidis, “Why Most Published Research Findings are False[i]” which has been viewed by about 2.8 million people since its publication, in 2005. Then he reminded us about about the "Nature" article “Reproducibility: A Tragedy of Errors[ii].” He ended his presentation with resources for scientists and statisticians engaged in clinical research on how to best report data:
- Harrington et al. “New Guidelines for Statistical Reporting in the Journal.” The New England Journal of Medicine, 381: 3, pages 285-286, July 18, 2019.
- Assel et al. “Guidelines for Reporting of Statistics for Clinical Research in Urology.” European Urology, 75:3, pages 358-367, March 2019.
David Carlson: "People may often have good intuition for what statistical model to use, but we need to remember that there are rigorous statistical techniques to determine the best model"
Dr. David Carlson, Assistant Professor of Civil and Environmental Engineering and Biostatistics and Bioinformatics, talked about Nuts and Bolts of Respecting Data: Model Fitting and Misinterpretation, in the context of machine learning. He asked the audience an apparently intuitive question – Do we always want to increase complexity to represent data? He talked about the risk of “overfitting”, defined as “the situation when the learned model increases complexity to fit the observed training data too well”. He showed pros and cons of using the linear model versus quadratic, 4th or 7th order models. Then he asked the audience to choose which model should be used and the audience general answer was that it ‘s best to use a simple model rather than a complex one. In the end, he concluded that even though people often have good intuition for what’s appropriate to use, we need to remember that there are rigorous statistical techniques to determine the best model.
And invited us to keep in mind these three main points:
- Make sure that you are not over-using data or over-interpreting results
- While some complex patterns are real, most are not! Over-fitting and over-interpretation hurts moving forward
- Be respectful of your data
[i] John P. A. Ioannidis. “Why Most Published Research Findings Are False.” PLOS Medicine 2(8): e124. (2005). https://doi.org/10.1371/journal.pmed.0020124
[ii] David B. Allison, Andrew W. Brown, Brandon J. George, and Kathryn A. Kaiser. “Reproducibility: A tragedy of errors.” Nature News, Springer Nature, Feb 3, 2016. doi:10.1038/530027a.