This workshop will enable participants to:
• Produce data descriptions and summaries to understand the data.
• Use statistical tools to clean and manipulate data
• Integrate relational data
• Identify and handle missing data
• Visualise data and explore patterns
• Improve their interdisciplinary team working skills
This course, jointly organised by NCRM and the UK Data Service, will introduce participants to the complexities of analysing data from multiple sources. It will cover issues of data quality, cleaning, derivation and linkage.
The increasing availability of data on all aspects of modern life - whether such data be open, archived or proprietary - has started to open up the possibility of drawing on multiple datasets to solve analytical problems.
Getting to know the data available is a fundamental step in data analysis. Not only does it allow us to know what they contain, their scope and shape, but also provides insights about the quality, format and other potential issues that affect the usability of the data. This is especially important when working with data from different sources, where inconsistencies between the different sources are more prone to occur presenting problems with merging or linking the datasets together.
The morning session will be focused on data cleaning and manipulation as an essential part of data analysis. In this session, we will learn how to identify the type of cleaning a particular data set needs in preparation for the data analysis. We will learn different techniques and practical tools to explore and manipulate the data with an emphasis on: checking the quality of the data, removing unnecessary data, creating new variables and dealing with potential errors and inconsistencies.
The afternoon session will be firstly devoted to discussing issues around missing data, with the goal of learning to identify missing data mechanisms and how different methods are applied to address missingness, depending on the underlying mechanism. Then, we will move on to discuss challenges around linking relational data and learn different methods to integrate data from different sources.
All sessions will include a mixture of presentations and hands-on practical activities. All the practical exercises will be done using R Studio. These practical sessions will give participants the opportunity to apply the main concepts discussed in the lectures to real-world data.
Day 2 will focus on working in teams to produce an analysis requiring them to work on multiple datasets. At the end of the day each team will present their solution.
On completion of this workshop, participants will gain new skills to understand the challenges of using real-world data and to apply a range of data analysis tools to process, clean and transform data into a suitable format for data analysis. Participants will also learn how to work with multiple datasets and apply practical methods for handling missing data.
Introduction to R webinar (optional)
The course will be taught using R. For those with no prior experience of R, an introductory webinar will be available from the UK Data Service on Thursday 5th September from 3:00 PM - 4:00 PM. A private link to the webinar will be sent to all participants to register if you wish to attend.
Reading materials (not compulsory)
Wickham, H; Grolemund, G. 2016”R for Data Science” available online: https://r4ds.had.co.nz/