OpenRefine for Social Science Data

(View Complete Item Description)

Lesson on OpenRefine for social scientists. A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identifed and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis. OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another. This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.

Material Type: Module

Authors: Erin Becker, François Michonneau, Geoff LaFlair, Karen Word, Lachlan Deer, Peter Smyth, Tracy Teal

Data Organization in Spreadsheets for Social Scientists

(View Complete Item Description)

Lesson on spreadsheets for social scientists. Good data organization is the foundation of any research project. Most researchers have data in spreadsheets, so it’s the place that many research projects start. Typically we organize data in spreadsheets in ways that we as humans want to work with the data. However computers require data to be organized in particular ways. In order to use tools that make computation more efficient, such as programming languages like R or Python, we need to structure our data the way that computers need the data. Since this is where most research projects start, this is where we want to start too! In this lesson, you will learn: Good data entry practices - formatting data tables in spreadsheets How to avoid common formatting mistakes Approaches for handling dates in spreadsheets Basic quality control and data manipulation in spreadsheets Exporting data from spreadsheets In this lesson, however, you will not learn about data analysis with spreadsheets. Much of your time as a researcher will be spent in the initial ‘data wrangling’ stage, where you need to organize the data to perform a proper analysis later. It’s not the most fun, but it is necessary. In this lesson you will learn how to think about data organization and some practices for more effective data wrangling. With this approach you can better format current data and plan new data collection so less data wrangling is needed.

Material Type: Module

Authors: David Mawdsley, Erin Becker, François Michonneau, Karen Word, Lachlan Deer, Peter Smyth

La Terminal de Unix

(View Complete Item Description)

Software Carpentry lección para la terminal de Unix La terminal de Unix ha existido por más tiempo que la mayoría de sus usuarios. Ha sobrevivido tanto tiempo porque es una herramienta poderosa que permite a las personas hacer cosas complejas con sólo unas pocas teclas. Lo más importante es que ayuda a combinar programas existentes de nuevas maneras y automatizar tareas repetitivas, en vez de estar escribiendo las mismas cosas una y otra vez. El uso del terminal o shell es fundamental para usar muchas otras herramientas poderosas y recursos informáticos (incluidos los supercomputadores o “computación de alto rendimiento”). Esta lección te guiará en el camino hacia el uso eficaz de estos recursos.

Material Type: Module

Authors: Adam Huffman, Alejandra Gonzalez-Beltran, AnaBVA, Andrew Sanchez, Anja Le Blanc, Ashwin Srinath, Brian Ballsun-Stanton, Colin Morris, csqrs, Dani Ledezma, Dave Bridges, Erin Becker, Francisco Palm, François Michonneau, Gabriel A. Devenyi, Gerard Capes, Giuseppe Profiti, Gordon Rhea, Jake Cowper Szamosi, Jared Flater, Jeff Oliver, Jonah Duckles, Juan M. Barrios, Katrin Leinweber, Kelly L. Rowland, Kevin Alquicira, Kunal Marwaha, LauCIFASIS, Marisa Lim, Martha Robinson, Matias Andina, Michael Zingale, Nicolas Barral, Nohemi Huanca Nunez, Olemis Lang, Otoniel Maya, Paula Andrea Martinez, Raniere Silva, Rayna M Harris, Shirley Alquicira, Silvana Pereyra, sjnair, Stéphane Guillou, Steve Leak, Thomas Mellan, Veronica Jimenez-Jacinto, William L. Close, Yee Mey

El Control de Versiones con Git

(View Complete Item Description)

Software Carpentry lección para control de versiones con Git Para ilustrar el poder de Git y GitHub, usaremos la siguiente historia como un ejemplo motivador a través de esta lección. El Hombre Lobo y Drácula han sido contratados por Universal Missions para investigar si es posible enviar su próximo explorador planetario a Marte. Ellos quieren poder trabajar al mismo tiempo en los planes, pero ya han experimentado ciertos problemas anteriormente al hacer algo similar. Si se rotan por turnos entonces cada uno gastará mucho tiempo esperando a que el otro termine, pero si trabajan en sus propias copias e intercambian los cambios por email, las cosas se perderán, se sobreescribirán o se duplicarán. Un colega sugiere utilizar control de versiones para lidiar con el trabajo. El control de versiones es mejor que el intercambio de ficheros por email: Nada se pierde una vez que se incluye bajo control de versiones, a no ser que se haga un esfuerzo sustancial. Como se van guardando todas las versiones precedentes de los ficheros, siempre es posible volver atrás en el tiempo y ver exactamente quién escribió qué en un día en particular, o qué versión de un programa fue utilizada para generar un conjunto de resultados en particular. Como se tienen estos registros de quién hizo qué y en qué momento, es posible saber a quién preguntar si se tiene una pregunta en un momento posterior y, si es necesario, revertir el contenido a una versión anterior, de forma similar a como funciona el comando “deshacer” de los editores de texto. Cuando varias personas colaboran en el mismo proyecto, es posible pasar por alto o sobreescribir de manera accidental los cambios hechos por otra persona. El sistema de control de versiones notifica automáticamente a los usuarios cada vez que hay un conflicto entre el trabajo de una persona y la otra. Los equipos no son los únicos que se benefician del control de versiones: los investigadores independientes se pueden beneficiar en gran medida. Mantener un registro de qué ha cambiado, cuándo y por qué es extremadamente útil para todos los investigadores si alguna vez necesitan retomar el proyecto en un momento posterior (e.g. un año después, cuando se ha desvanecido el recuerdo de los detalles).

Material Type: Module

Authors: Alejandra Gonzalez-Beltran, Amy Olex, Belinda Weaver, Bradford Condon, butterflyskip, Casey Youngflesh, Daisie Huang, Dani Ledezma, dounia, Francisco Palm, Garrett Bachant, Heather Nunn, Hely Salgado, Ian Lee, Ivan Gonzalez, James E McClure, Javier Forment, Jimmy O'Donnell, Jonah Duckles, Katherine Koziar, Katrin Leinweber, K.E. Koziar, Kevin Alquicira, Kevin MF, Kurt Glaesemann, LauCIFASIS, Leticia Vega, Lex Nederbragt, Mark Woodbridge, Matias Andina, Matt Critchlow, Mingsheng Zhang, Nelly Sélem, Nima Hejazi, Nohemi Huanca Nunez, Olemis Lang, Paula Andrea Martinez, Peace Ossom Williamson, P. L. Lim, Rayna M Harris, Romualdo Zayas-Lagunas, Sarah Stevens, Saskia Hiltemann, Shirley Alquicira, Silvana Pereyra, Tom Morrell, Valentina Bonetti, Veronica Ikeshoji-Orlati, Veronica Jimenez

Data Organization in Spreadsheets for Ecologists

(View Complete Item Description)

Good data organization is the foundation of any research project. Most researchers have data in spreadsheets, so it’s the place that many research projects start. We organize data in spreadsheets in the ways that we as humans want to work with the data, but computers require that data be organized in particular ways. In order to use tools that make computation more efficient, such as programming languages like R or Python, we need to structure our data the way that computers need the data. Since this is where most research projects start, this is where we want to start too! In this lesson, you will learn: Good data entry practices - formatting data tables in spreadsheets How to avoid common formatting mistakes Approaches for handling dates in spreadsheets Basic quality control and data manipulation in spreadsheets Exporting data from spreadsheets In this lesson, however, you will not learn about data analysis with spreadsheets. Much of your time as a researcher will be spent in the initial ‘data wrangling’ stage, where you need to organize the data to perform a proper analysis later. It’s not the most fun, but it is necessary. In this lesson you will learn how to think about data organization and some practices for more effective data wrangling. With this approach you can better format current data and plan new data collection so less data wrangling is needed.

Material Type: Module

Authors: Christie Bahlai, Peter R. Hoyt, Tracy Teal

Data Cleaning with OpenRefine for Ecologists

(View Complete Item Description)

A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identified and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis. OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another. This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.

Material Type: Module

Authors: Cam Macdonell, Deborah Paul, Phillip Doehle, Rachel Lombardi

Data Management with SQL for Ecologists

(View Complete Item Description)

Databases are useful for both storing and using data effectively. Using a relational database serves several purposes. It keeps your data separate from your analysis. This means there’s no risk of accidentally changing data when you analyze it. If we get new data we can rerun a query to find all the data that meets certain criteria. It’s fast, even for large amounts of data. It improves quality control of data entry (type constraints and use of forms in Access, Filemaker, etc.) The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python. This lesson will teach you what relational databases are, how you can load data into them and how you can query databases to extract just the information that you need.

Material Type: Module

Authors: Christina Koch, Donal Heidenblad, Katy Felkner, Rémi Rampin, Timothée Poisot

Data Analysis and Visualization in R for Ecologists

(View Complete Item Description)

Data Carpentry lesson from Ecology curriculum to learn how to analyse and visualise ecological data in R. Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain. The lessons below were designed for those interested in working with ecology data in R. This is an introduction to R designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, how to calculate summary statistics from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from R.

Material Type: Module

Authors: Ankenbrand, Markus, Arindam Basu, Ashander, Jaime, Bahlai, Christie, Bailey, Alistair, Becker, Erin Alison, Bledsoe, Ellen, Boehm, Fred, Bolker, Ben, Bouquin, Daina, Burge, Olivia Rata, Burle, Marie-Helene, Carchedi, Nick, Chatzidimitriou, Kyriakos, Chiapello, Marco, Conrado, Ana Costa, Cortijo, Sandra, Cranston, Karen, Cuesta, Sergio Martínez, Culshaw-Maurer, Michael, Czapanskiy, Max, Daijiang Li, Dashnow, Harriet, Daskalova, Gergana, Deer, Lachlan, Direk, Kenan, Dunic, Jillian, Elahi, Robin, Fishman, Dmytro, Fouilloux, Anne, Fournier, Auriel, Gan, Emilia, Goswami, Shubhang, Guillou, Stéphane, Hancock, Stacey, Hardenberg, Achaz Von, Harrison, Paul, Hart, Ted, Herr, Joshua R., Hertweck, Kate, Hodges, Toby, Hulshof, Catherine, Humburg, Peter, Jean, Martin, Johnson, Carolina, Johnson, Kayla, Johnston, Myfanwy, Jordan, Kari L, K. A. S. Mislan, Kaupp, Jake, Keane, Jonathan, Kerchner, Dan, Klinges, David, Koontz, Michael, Leinweber, Katrin, Lepore, Mauro Luciano, Lijnzaad, Philip, Li, Ye, Lotterhos, Katie, Mannheimer, Sara, Marwick, Ben, Michonneau, François, Millar, Justin, Moreno, Melissa, Najko Jahn, Obeng, Adam, Odom, Gabriel J., Pauloo, Richard, Pawlik, Aleksandra Natalia, Pearse, Will, Peck, Kayla, Pederson, Steve, Peek, Ryan, Pletzer, Alex, Quinn, Danielle, Rajeg, Gede Primahadi Wijaya, Reiter, Taylor, Rodriguez-Sanchez, Francisco, Sandmann, Thomas, Seok, Brian, Sfn_brt, Shiklomanov, Alexey, Shivshankar Umashankar, Stachelek, Joseph, Strauss, Eli, Sumedh, Switzer, Callin, Tarkowski, Leszek, Tavares, Hugo, Teal, Tracy, Theobold, Allison, Tirok, Katrin, Tylén, Kristian, Vanichkina, Darya, Voter, Carolyn, Webster, Tara, Weisner, Michael, White, Ethan P, Wilson, Earle, Woo, Kara, Wright, April, Yanco, Scott, Ye, Hao

Data Analysis and Visualization in Python for Ecologists

(View Complete Item Description)

Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data. This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in one and a half days (~ 10 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python.

Material Type: Module

Authors: Maxim Belkin, Tania Allard

Intro to R and RStudio for Genomics

(View Complete Item Description)

Welcome to R! Working with a programming language (especially if it’s your first time) often feels intimidating, but the rewards outweigh any frustrations. An important secret of coding is that even experienced programmers find it difficult and frustrating at times – so if even the best feel that way, why let intimidation stop you? Given time and practice* you will soon find it easier and easier to accomplish what you want. Why learn to code? Bioinformatics – like biology – is messy. Different organisms, different systems, different conditions, all behave differently. Experiments at the bench require a variety of approaches – from tested protocols to trial-and-error. Bioinformatics is also an experimental science, otherwise we could use the same software and same parameters for every genome assembly. Learning to code opens up the full possibilities of computing, especially given that most bioinformatics tools exist only at the command line. Think of it this way: if you could only do molecular biology using a kit, you could probably accomplish a fair amount. However, if you don’t understand the biochemistry of the kit, how would you troubleshoot? How would you do experiments for which there are no kits? R is one of the most widely-used and powerful programming languages in bioinformatics. R especially shines where a variety of statistical tools are required (e.g. RNA-Seq, population genomics, etc.) and in the generation of publication-quality graphs and figures. Rather than get into an R vs. Python debate (both are useful), keep in mind that many of the concepts you will learn apply to Python and other programming languages. Finally, we won’t lie; R is not the easiest-to-learn programming language ever created. So, don’t get discouraged! The truth is that even with the modest amount of R we will cover today, you can start using some sophisticated R software packages, and have a general sense of how to interpret an R script. Get through these lessons, and you are on your way to being an accomplished R user! * We very intentionally used the word practice. One of the other “secrets” of programming is that you can only learn so much by reading about it. Do the exercises in class, re-do them on your own, and then work on your own problems.

Material Type: Module

Authors: Ahmed Moustafa, Alexia Cardona, Andrea Ortiz, Jason Williams, Krzysztof Poterlowicz, Naupaka Zimmerman, Yuka Takemon

Data Analysis and Visualization with Python for Social Scientists

(View Complete Item Description)

Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data. This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python.

Material Type: Module

Authors: Geoffrey Boushey, Stephen Childs

Data Management with SQL for Social Scientists

(View Complete Item Description)

This is an alpha lesson to teach Data Management with SQL for Social Scientists, We welcome and criticism, or error; and will take your feedback into account to improve both the presentation and the content. Databases are useful for both storing and using data effectively. Using a relational database serves several purposes. It keeps your data separate from your analysis. This means there’s no risk of accidentally changing data when you analyze it. If we get new data we can rerun a query to find all the data that meets certain criteria. It’s fast, even for large amounts of data. It improves quality control of data entry (type constraints and use of forms in Access, Filemaker, etc.) The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python. This lesson will teach you what relational databases are, how you can load data into them and how you can query databases to extract just the information that you need.

Material Type: Module

Author: Peter Smyth

Geospatial Workshop Overview

(View Complete Item Description)

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain. Interested in teaching these materials? We have an onboarding video available to prepare Instructors to teach these lessons. After watching this video, please contact team@carpentries.org so that we can record your status as an onboarded Instructor. Instructors who have completed onboarding will be given priority status for teaching at centrally-organized Data Carpentry Geospatial workshops.

Material Type: Module

Authors: Anne Fouilloux, Arthur Endsley, Chris Prener, Jeff Hollister, Joseph Stachelek, Leah Wasser, Michael Sumner, Michele Tobias, Stace Maples

Image Processing with Python

(View Complete Item Description)

This lesson shows how to use Python and skimage to do basic image processing. With support from an NSF iUSE grant, Dr. Tessa Durham Brooks and Dr. Mark Meysenburg at Doane College, Nebraska, USA have developed a curriculum for teaching image processing in Python. This lesson is currently being piloted at different institutions. This pilot phase will be followed by a clean-up phase to incorporate suggestions and feedback from the pilots into the lessons and to make the lessons teachable by the broader community. Development for these lessons has been supported by a grant from the Sloan Foundation.

Material Type: Module

Author: Mark Meysenberg

Introduction to the Command Line for Economics

(View Complete Item Description)

Command line interface (OS shell) and graphic user interface (GUI) are different ways of interacting with a computer’s operating system. The shell is a program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination. There are quite a few reasons to start learning about the shell: The shell gives you power. The command line gives you the power to do your work more efficiently and more quickly. When you need to do things tens to hundreds of times, knowing how to use the shell is transformative. To use remote computers or cloud computing, you need to use the shell.

Material Type: Module

Authors: Andras Vereckei, Arieda Muço, Miklós Koren

Economics Lesson with Stata

(View Complete Item Description)

A Data Carpentry curriculum for Economics is being developed by Dr. Miklos Koren at Central European University. These materials are being piloted locally. Development for these lessons has been supported by a grant from the Sloan Foundation.

Material Type: Module

Authors: Andras Vereckei, Arieda Muço, Miklós Koren

Data Carpentry for Biologists

(View Complete Item Description)

The Biology Semester-long Course was developed and piloted at the University of Florida in Fall 2015. Course materials include readings, lectures, exercises, and assignments that expand on the material presented at workshops focusing on SQL and R.

Material Type: Module

Authors: Ethan White, Zachary Brym

Programming with R

(View Complete Item Description)

The best way to learn how to program is to do something useful, so this introduction to R is built around a common scientific task: data analysis. Our real goal isn’t to teach you R, but to teach you the basic concepts that all programming depends on. We use R in our lessons because: we have to use something for examples; it’s free, well-documented, and runs almost everywhere; it has a large (and growing) user base among scientists; and it has a large library of external packages available for performing diverse tasks. But the two most important things are to use whatever language your colleagues are using, so you can share your work with them easily, and to use that language well. We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets of their daily inflammation. The data sets are stored in CSV format (comma-separated values): each row holds information for a single patient, and the columns represent successive days. The first few rows of our first file look like this: 0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1 We want to: load that data into memory, calculate the average inflammation per day across all patients, and plot the result. To do all that, we’ll have to learn a little bit about programming.

Material Type: Module

Authors: Diya Das, Katrin Leinweber, Rohit Goswami

Programming with MATLAB

(View Complete Item Description)

The best way to learn how to program is to do something useful, so this introduction to MATLAB is built around a common scientific task: data analysis. Our real goal isn’t to teach you MATLAB, but to teach you the basic concepts that all programming depends on. We use MATLAB in our lessons because: we have to use something for examples; it’s well-documented; it has a large (and growing) user base among scientists in academia and industry; and it has a large library of packages available for performing diverse tasks. But the two most important things are to use whatever language your colleagues are using, so that you can share your work with them easily, and to use that language well.

Material Type: Module

Author: Gerard Capes

Python for Humanities

(View Complete Item Description)

Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data. This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python.

Material Type: Module

Author: Iain Emsley

Researchers

All resources in Researchers

OpenRefine for Social Science Data

Data Organization in Spreadsheets for Social Scientists

La Terminal de Unix

El Control de Versiones con Git

Data Organization in Spreadsheets for Ecologists

Data Cleaning with OpenRefine for Ecologists

Data Management with SQL for Ecologists

Data Analysis and Visualization in R for Ecologists

Data Analysis and Visualization in Python for Ecologists

Intro to R and RStudio for Genomics

Data Analysis and Visualization with Python for Social Scientists

Data Management with SQL for Social Scientists

Geospatial Workshop Overview

Image Processing with Python

Introduction to the Command Line for Economics

Economics Lesson with Stata

Data Carpentry for Biologists

Programming with R

Programming with MATLAB

Python for Humanities