This lesson is being piloted (Beta version)

R for Research: Glossary

Key Points

R Project Setup
  • Use RStudio to write and run R programs.

  • Use Git to track changes in a project

Research Project Organisation
  • Build and maintain a project directory that is easy to navigate.

  • Understand when and how to

Introduction to Spreadsheets
  • Organizing your data tables according to tidy data principles will make them easier for you and others to use for analysis.

Formatting data tables in Spreadsheets
  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data.

  • Organize your data according to tidy data principles.

  • Record metadata in a separate plain text file.

Formatting problems
  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Don’t use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

Dates as data
  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.

Quality assurance
  • Always copy your original spreadsheet file and work with a copy so you don’t affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.

Exporting data
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Introduction to OpenRefine
  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done

Working with OpenRefine
  • OpenRefine can import a variety of file types.

  • OpenRefine can be used to explore data using filters.

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

  • OpenRefine can transform the values of a column.

Filtering and Sorting with OpenRefine
  • OpenRefine provides a way to sort and filter data without affecting the raw data.

Examining Numbers in OpenRefine
  • OpenRefine also provides ways to get overviews of numerical data.

Using scripts
  • All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.

Exporting and Saving Data from OpenRefine
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Other Resources in OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Hello World - interacting with R
  • Use RStudio to write and run R programs.

  • Use install.packages() to install packages (libraries).

Starting with Coding
  • Access individual values by location using [].

  • Access slices of data using [low:high].

  • Access arbitrary sets of data using [c(...)].

  • Use logical operations and logical vectors to access subsets of data.

Starting with Data
  • Use read_csv to read tabular data in R.

  • Use factors to represent categorical data in R.

Introducing dplyr and tidyr
  • Use the dplyr package to manipulate dataframes.

  • Use select() to choose variables from a dataframe.

  • Use filter() to choose data based on values.

  • Use group_by() and summarize() to work with subsets of data.

  • Use mutate() to create new variables.

  • Use the tidyr package to change the layout of dataframes.

  • Use gather() to go from wide to long format.

  • Use spread() to go from long to wide format.

Data visualisation with ggplot2
  • ggplot2 is a flexible and useful tool for creating plots in R.

  • The data set and coordinate system can be defined using the ggplot function.

  • Additional layers, including geoms, are added using the + operator.

  • Boxplots are useful for visualizing the distribution of a continuous variable.

  • Barplot are useful for visualizing categorical data.

  • Faceting allows you to generate multiple plots based on a categorical variable.

Glossary

FIXME