Question: What are the benefits and risks of unrestricted data use?
As the wave of machine learning techniques sweeps across nearly every field in science, data availability and quality have found their way into many debates. Reusing data saves time and money. Sharing data enables us to learn from failed ventures while building upon successes. Furthermore, the advancement of machine learning unlocks the potential of rich databases to yield new findings when re-analysing data. For example, deep learning using The Cancer Genome Atlas revealed new cancer-causing mutations while the critical features of a class of semiconductors were predicted using Novel Materials Discovery. However, unrestricted data reuse has its pitfalls. Giving credit where it is due is not only ethical, but also important for quality control. Even now, plagiarism rates are nearly 25% in some USA states and a similar issue will likely surface in data reuse. When reusing data, whether by the original author in a new publication or by other scholars, complications may arise. This could stem from the lack of metadata, force-fitting existing data to new hypotheses or making invalid assumptions when lacking the bigger picture. van Raaji identified 18 problems associated with data reuse, some of which such as contradicting previous conclusions from the data without explanation are seriously worrying. These actions compromise the quality of research and mislead others even if unintentional. Laboratories have much to gain from data reuse, but only if done methodically and under regulation. Published data must meet standards for reusability while providing licensing details to support safeguarding intellectual property. The FAIR (findability, accessibility, interoperability, reusability) principles are great guidelines for sharing data. However, better structural support and incentives are needed to achieve this. With over 28 million public repositories of source code, GitHub encourages users to document code and authorship based on standard practices. An equivalent host for data sharing can be established. While similar infrastructure exists (eg. ToxFx for pharmacology), a cross-disciplinary, universally standardised database with well-documented data handling procedures is lacking. As data grows beyond petabytes, traditional peer-reviewing systems cannot keep up. A public reporting system is crucial to prevent the perpetuation of wrong data so scientists who reuse the data can simultaneously look out for red flags and report them. Others need to be alerted to mistakes, suspicious and incomplete data. We also need a culture where scientists who reuse data are responsive to clarification and debate should others detect potential misinterpretations. Despite its sheer volume, diversity is severely lacking in existing data. Most existing data were generated by well-funded, developed countries and hence target their populations and problems. For example, genomic data of the European population is comprehensive whereas that of ethnic minorities is sparse. Manrai et al. found that black Americans were more often misdiagnosed for hypertrophic cardiomyopathy than white Americans due to severe under-representation in genomic data. Reusing existing data is certainly convenient, but it remains our responsibility to gather new data to correct this disparity. Data informs most of mankind’s decisions. For better decisions, we need better science and for that, we need better data.