This article discusses Project FeederWatch, a real-time citizen science project, and how …
This article discusses Project FeederWatch, a real-time citizen science project, and how elementary teachers can use this bird data to integrate math lessons and concepts.
Data Carpentry Genomics workshop lesson to learn how to structure your metadata, …
Data Carpentry Genomics workshop lesson to learn how to structure your metadata, organize and document your genomics data and bioinformatics workflow, and access data on the NCBI sequence read archive (SRA) database. Good data organization is the foundation of any research project. It not only sets you up well for an analysis, but it also makes it easier to come back to the project later and share with collaborators, including your most important collaborator - future you. Organizing a project that includes sequencing involves many components. There’s the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we’ll go through the project organization and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information. In this lesson, we’ll be using data from a study of experimental evolution using E. coli. More information about this dataset is available here. In this study there are several types of files: Spreadsheet data from the experiment that tracks the strains and their phenotype over time Spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions The sequence data Throughout the analysis, we’ll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used. In this lesson you will learn: How to structure your metadata, tabular data and information about the experiment. The metadata is the information about the experiment and the samples you’re sequencing. How to prepare for, understand, organize and store the sequencing data that comes back from the sequencing center How to access and download publicly available data that may need to be used in your bioinformatics analysis The concepts of organizing the files and documenting the workflow of your bioinformatics analysis
The soup-to-nuts exercises take students through the entire process of research with …
The soup-to-nuts exercises take students through the entire process of research with statistical data, from the very beginning when they first access the original data, through cleaning and processing the data to prepare them for analysis, to the very end when they generate the results that they present in a written report. Throughout each exercise, there will be an emphasis on adopting a transparent workflow and constructing replication documentation that ensures all the work done for the exercise can be independently reproduced.
Background There is increasing interest to make primary data from published research …
Background There is increasing interest to make primary data from published research publicly available. We aimed to assess the current status of making research data available in highly-cited journals across the scientific literature. Methods and Results We reviewed the first 10 original research papers of 2009 published in the 50 original research journals with the highest impact factor. For each journal we documented the policies related to public availability and sharing of data. Of the 50 journals, 44 (88%) had a statement in their instructions to authors related to public availability and sharing of data. However, there was wide variation in journal requirements, ranging from requiring the sharing of all primary data related to the research to just including a statement in the published manuscript that data can be available on request. Of the 500 assessed papers, 149 (30%) were not subject to any data availability policy. Of the remaining 351 papers that were covered by some data availability policy, 208 papers (59%) did not fully adhere to the data availability instructions of the journals they were published in, most commonly (73%) by not publicly depositing microarray data. The other 143 papers that adhered to the data availability instructions did so by publicly depositing only the specific data type as required, making a statement of willingness to share, or actually sharing all the primary data. Overall, only 47 papers (9%) deposited full primary raw data online. None of the 149 papers not subject to data availability policies made their full primary data publicly available. Conclusion A substantial proportion of original research papers published in high-impact journals are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals. This empiric evaluation highlights opportunities for improvement.
Policies that mandate public data archiving (PDA) successfully increase accessibility to data …
Policies that mandate public data archiving (PDA) successfully increase accessibility to data underlying scientific publications. However, is the data quality sufficient to allow reuse and reanalysis? We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. We suggest that cultural shifts facilitating clearer benefits to authors are necessary to achieve high-quality PDA and highlight key guidelines to help authors increase their data’s reuse potential and compliance with journal data policies.
Background The p value obtained from a significance test provides no information …
Background The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias. Methods We investigate whether effect size is independent from sample size in psychological research. We randomly sampled 1,000 psychological articles from all areas of psychological research. We extracted p values, effect sizes, and sample sizes of all empirical papers, and calculated the correlation between effect size and sample size, and investigated the distribution of p values. Results We found a negative correlation of r = −.45 [95% CI: −.53; −.35] between effect size and sample size. In addition, we found an inordinately high number of p values just passing the boundary of significance. Additional data showed that neither implicit nor explicit power analysis could account for this pattern of findings. Conclusion The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.
P values represent a widely used, but pervasively misunderstood and fiercely contested …
P values represent a widely used, but pervasively misunderstood and fiercely contested method of scientific inference. Display items, such as figures and tables, often containing the main results, are an important source of P values. We conducted a survey comparing the overall use of P values and the occurrence of significant P values in display items of a sample of articles in the three top multidisciplinary journals (Nature, Science, PNAS) in 2017 and, respectively, in 1997. We also examined the reporting of multiplicity corrections and its potential influence on the proportion of statistically significant P values. Our findings demonstrated substantial and growing reliance on P values in display items, with increases of 2.5 to 14.5 times in 2017 compared to 1997. The overwhelming majority of P values (94%, 95% confidence interval [CI] 92% to 96%) were statistically significant. Methods to adjust for multiplicity were almost non-existent in 1997, but reported in many articles relying on P values in 2017 (Nature 68%, Science 48%, PNAS 38%). In their absence, almost all reported P values were statistically significant (98%, 95% CI 96% to 99%). Conversely, when any multiplicity corrections were described, 88% (95% CI 82% to 93%) of reported P values were statistically significant. Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics. Overall, wider appreciation of the need for multiplicity corrections is a welcome evolution, but the rapid growth of reliance on P values and implausibly high rates of reported statistical significance are worrisome.
Python is a general purpose programming language that is useful for writing …
Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data. This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python.
Discussions of how to improve research quality are predominant in a number …
Discussions of how to improve research quality are predominant in a number of fields, including education. But how prevalent are the use of problematic practices and the improved practices meant to counter them? This baseline information will be a critical data source as education researchers seek to improve our research practices. In this preregistered study, we replicated and extended previous studies from other fields by asking education researchers about 10 questionable research practices and 5 open research practices. We asked them to estimate the prevalence of the practices in the field, self-report their own use of such practices, and estimate the appropriateness of these behaviors in education research. We made predictions under four umbrella categories: comparison to psychology, geographic location, career stage, and quantitative orientation. Broadly, our results suggest that both questionable and open research practices are part of the typical research practices of many educational researchers. Preregistration, code, and data can be found at https://osf.io/83mwk/.
We surveyed 807 researchers (494 ecologists and 313 evolutionary biologists) about their …
We surveyed 807 researchers (494 ecologists and 313 evolutionary biologists) about their use of Questionable Research Practices (QRPs), including cherry picking statistically significant results, p hacking, and hypothesising after the results are known (HARKing). We also asked them to estimate the proportion of their colleagues that use each of these QRPs. Several of the QRPs were prevalent within the ecology and evolution research community. Across the two groups, we found 64% of surveyed researchers reported they had at least once failed to report results because they were not statistically significant (cherry picking); 42% had collected more data after inspecting whether results were statistically significant (a form of p hacking) and 51% had reported an unexpected finding as though it had been hypothesised from the start (HARKing). Such practices have been directly implicated in the low rates of reproducible results uncovered by recent large scale replication studies in psychology and other disciplines. The rates of QRPs found in this study are comparable with the rates seen in psychology, indicating that the reproducibility problems discovered in psychology are also likely to be present in ecology and evolution.
Hypothesizing after the results are known (HARK) has been disparaged as data …
Hypothesizing after the results are known (HARK) has been disparaged as data dredging, and safeguards including hypothesis preregistration and statistically rigorous oversight have been recommended. Despite potential drawbacks, HARK has deepened thinking about complex causal processes. Some of the HARK precautions can conflict with the modern reality of researchers’ obligations to use big, ‘organic’ data sources—from high-throughput genomics to social media streams. We here propose a HARK-solid, reproducible inference framework suitable for big data, based on models that represent formalization of hypotheses. Reproducibility is attained by employing two levels of model validation: internal (relative to data collated around hypotheses) and external (independent to the hypotheses used to generate data or to the data used to generate hypotheses). With a model-centered paradigm, the reproducibility focus changes from the ability of others to reproduce both data and specific inferences from a study to the ability to evaluate models as representation of reality. Validation underpins ‘natural selection’ in a knowledge base maintained by the scientific community. The community itself is thereby supported to be more productive in generating and critically evaluating theories that integrate wider, complex systems.
Student teams assign importance factors, called "desirability points," the rock properties found …
Student teams assign importance factors, called "desirability points," the rock properties found in the previous lesson/activity in order to mathematically determine the overall best rocks for building caverns within. They learn the real-world connections and relationships between the rock and the important engineering properties for designing and building caverns (or tunnels, mines, building foundations, etc.).
Students will learn about the water cycle, watersheds, and point and non-point …
Students will learn about the water cycle, watersheds, and point and non-point source pollution. Students will then apply this knowledge to take a position in the debate about the proposed development at Hawn's Bridge Peninsula at Raystown Lake and write a letter to the editor expressing their opinion. Pairs well with an Engineering Design Challenge or a Meaningful Watershed Educational Experience (MWEE).
The recent ‘replication crisis’ in psychology has focused attention on ways of …
The recent ‘replication crisis’ in psychology has focused attention on ways of increasing methodological rigor within the behavioral sciences. Part of this work has involved promoting ‘Registered Reports’, wherein journals peer review papers prior to data collection and publication. Although this approach is usually seen as a relatively recent development, we note that a prototype of this publishing model was initiated in the mid-1970s by parapsychologist Martin Johnson in the European Journal of Parapsychology (EJP). A retrospective and observational comparison of Registered and non-Registered Reports published in the EJP during a seventeen-year period provides circumstantial evidence to suggest that the approach helped to reduce questionable research practices. This paper aims both to bring Johnson’s pioneering work to a wider audience, and to investigate the positive role that Registered Reports may play in helping to promote higher methodological and statistical standards.
Students become familiar with the online Renewable Energy Living Lab interface and …
Students become familiar with the online Renewable Energy Living Lab interface and access its real-world solar energy data to evaluate the potential for solar generation in various U.S. locations. They become familiar with where the most common sources of renewable energy are distributed across the U.S. Through this activity, students and teachers gain familiarity with the living lab's GIS graphic interface and query functions, and are exposed to the available data in renewable energy databases, learning how to query to find specific information for specific purposes. The activity is intended as a "training" activity prior to conducting activities such as The Bright Idea activity, which includes a definitive and extensive end product (a feasibility plan) for students to create.
Students use real-world data to evaluate the feasibility of solar energy and …
Students use real-world data to evaluate the feasibility of solar energy and other renewable energy sources in different U.S. locations. Working in small groups, students act as engineers evaluating the suitability of installing solar panels at four company locations. They access data from the online Renewable Energy Living Lab from which they make calculations and analyze how successful solar energy generation would be, as well as the potential for other power sources at those locations. Then they summarize their results, analysis and recommendations in the form of feasibility plans prepared for a CEO.
This course was developed and taught by Ben Marwick, Professor of Archaeology …
This course was developed and taught by Ben Marwick, Professor of Archaeology at University of Washington. It is a requirement for the UW Master of Science in Data Science, introduces students to the principles and tools for computational reproducibility in data science using R. Topics covered include acquiring, cleaning and manipulating data in a reproducible workflow using the tidyverse. Students will use literate programming tools, and explore best practices for organizing data analyses. Students will learn to write documents using R markdown, compile R markdown documents using knitr and related tools, and publish reproducible documents to various common formats. Students will learn strategies and tools for packaging research compendia, dependency management, and containerising projects to provide computational isolation.
No restrictions on your remixing, redistributing, or making derivative works. Give credit to the author, as required.
Your remixing, redistributing, or making derivatives works comes with some restrictions, including how it is shared.
Your redistributing comes with some restrictions. Do not remix or make derivative works.
Most restrictive license type. Prohibits most uses, sharing, and any changes.
Copyrighted materials, available under Fair Use and the TEACH Act for US-based educators, or other custom arrangements. Go to the resource provider to see their individual restrictions.