R Project Setup
|
|
Research Project Organisation
|
|
Introduction to Spreadsheets
|
|
Formatting data tables in Spreadsheets
|
Never modify your raw data. Always make a copy before making any changes.
Keep track of all of the steps you take to clean your data.
Organize your data according to tidy data principles.
Record metadata in a separate plain text file.
|
Formatting problems
|
Avoid using multiple tables within one spreadsheet.
Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).
Record zeros as zeros.
Use an appropriate null value to record missing data.
Don’t use formatting to convey information or to make your spreadsheet look pretty.
Place comments in a separate column.
Record units in column headers.
Include only one piece of information in a cell.
Avoid spaces, numbers and special characters in column headers.
Avoid special characters in your data.
|
Dates as data
|
|
Quality assurance
|
|
Exporting data
|
Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.
Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.
|
Introduction to OpenRefine
|
OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done
|
Working with OpenRefine
|
OpenRefine can import a variety of file types.
OpenRefine can be used to explore data using filters.
Clustering in OpenRefine can help to identify different values that might mean the same thing.
OpenRefine can transform the values of a column.
|
Filtering and Sorting with OpenRefine
|
|
Examining Numbers in OpenRefine
|
|
Using scripts
|
|
Exporting and Saving Data from OpenRefine
|
Cleaned data or entire projects can be exported from OpenRefine.
Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.
|
Other Resources in OpenRefine
|
|
Hello World - interacting with R
|
|
Starting with Coding
|
Access individual values by location using [] .
Access slices of data using [low:high] .
Access arbitrary sets of data using [c(...)] .
Use logical operations and logical vectors to access subsets of data.
|
Starting with Data
|
|
Introducing dplyr and tidyr
|
Use the dplyr package to manipulate dataframes.
Use select() to choose variables from a dataframe.
Use filter() to choose data based on values.
Use group_by() and summarize() to work with subsets of data.
Use mutate() to create new variables.
Use the tidyr package to change the layout of dataframes.
Use gather() to go from wide to long format.
Use spread() to go from long to wide format.
|
Data visualisation with ggplot2
|
ggplot2 is a flexible and useful tool for creating plots in R.
The data set and coordinate system can be defined using the ggplot function.
Additional layers, including geoms, are added using the + operator.
Boxplots are useful for visualizing the distribution of a continuous variable.
Barplot are useful for visualizing categorical data.
Faceting allows you to generate multiple plots based on a categorical variable.
|