Tentative schedule | Classroom 1 (Data literacy) | Classroom 2 (Data management and archiving) | Classroom 3 (Data documentation and reproducibility) | Classroom 4 (Data access, governance and ethics) |
Morning | Data Visualization with R’s ggplot Package (3 hour) | How to set up and configure a dataverse repository that suits your needs, a hands-on session (2 hour) | Learn to Use IPUMS API’s (3 hour) | De-identification by Design: Creating Ethical Data Derivatives with Python (3 hour) |
Afternoon | Introduction to Network Analysis and Visualization Using Gephi (3 hour) | SSHOC Open Science and Research Data Management Train-the-Trainer Bootcamp (3 hour) | What is the DDI? An introduction to the Data Documentation Initiative metadata standard (3 hour) | Statistical Disclosure Control in a secure data environment: Training output checkers and analysts (3 hours) |
Abstracts
Title | Session | Abstract |
Data Visualization with R’s ggplot package | Class 1, Morning | “R is a free open-source software for statistical analysis. It is used widely in the social sciences among other fields. R’s ggplot package is built for making professional looking graphs with relatively little effort. The package offers powerful graphics language and is easy to learn.
This hands-on workshop is designed to introduce participants to the principles of data visualization using R’s ggplot package. The workshop will start with a short discussion of how to design an effective graph and the best visualization to use, given your message, audience and the type of data used. This will be followed by a hands-on instruction designed to familiarize participants with the syntax used in R’s ggplot package to organize components of data and translate it into a graph. Data management and manipulation are integral parts of the data visualization process. We will start with a few basic R functions, preparing the data for graphing. Once data are in the right format, we will focus our attention on the ggplot package. Working step-by-step, through examples, we will understand the structure of the ggplot code and the connection between variables in the data and the colors, shapes and points that will appear on the graph. Topics covered include plotting continuous and categorical variables, layering information on graphics, producing facets for a compact presentation and comparability of information in multiple plots, and working with an output from a statistical model. We will use the ggplot functions to modify and refine our graphs. Template R code will be provided to workshop participants allowing participants to reproduce all workshop examples. This is an introductory level workshop. No previous experience needed. However, some familiarity with data analysis and/or R may be helpful.” |
How to setup and configure a dataverse repository that suits… | Class 2, Morning | Since its creation, the Center for socio political data of Sciences Po has been committed to design and develop services to support research data lifecycle.
Pursuing its mission for data preservation and dissemination over the years, it has been also engaged in creating innovative information systems and tools for data collection. In 2016, the CDSP launched the first Dataverse repository in France, and has built a strong experience with Harvard IQSS’ Dataverse open source solution. We propose a workshop session to share our expertise. During this workshop, we will guide you through the complete process of launching a test Dataverse repository instance, and performing some useful adjustments and customisations that suits your project and environment. |
Learn to use IPUMS APIs | Class 3, Morning | “IPUMS NHGIS (National Historical Geographic Information System) provides easy access to summary tables and time series of United States population, housing, agriculture, and economic data, along with GIS-compatible boundary files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks. Until recently, access to these data was exclusively through a web-based graphical user interface. Newly available application programming interfaces (APIs) now enable users to access NHGIS data programmatically. By providing a structured extract definition format and programmatic access to NHGIS data, the APIs facilitate transparent documentation and reproducibility of users’ extract requests.
This workshop will introduce users to the NHGIS APIs with hands-on demonstrations and exercises. Participants will learn how to access metadata describing the NHGIS collection, including information about datasets, tables, time series, and shapefiles. We will then guide participants through the process of constructing data extract requests that can be submitted and retrieved via the API. We will explore both simple extract requests for individual tables, as well as more complex requests involving time series, multiple datasets, and shapefiles. Upon completing the workshop, participants will be able to use the NHGIS API for common use cases, such as submitting a series of related extracts, setting up a common extract request to update periodically, and sharing a structured definition of an extract with colleagues. The workshop will be presented using R, though the API can also be accessed via other languages, including Python and Curl. See developer.ipums.org for more information, including example code.” |
De-identification by design: Creating Ethical Data Derivatives with Python | Class 4, Morning | “Research and proprietary data often contain personally identifiable information, with variables that reveal details about the lives of individuals and may have been collected without the person’s knowledge or consent. Datasets aggregated at the individual level often interest social science scholars, yet such data poses a risk of identification and create an ethical dilemma for curators.
While some types of information and data are legally protected, other social data, such as home mortgage files, voter registration files, and tax parcel records are public and are often augmented with modeled indicators, such as religious belief or personal income, that may not represent the reality of people’s lives. Library information and data specialists must develop infrastructure, workflows, and policies to ensure the ethical stewardship and use of these datasets. This interactive workshop will explore the tension between making purchased data as widely accessible to researchers as possible, while also ensuring that sensitive data is not abused. Following a short discussion of some of the above challenges, we will introduce participants to technologies and workflows for data de-identification. Covering basic principles of data management, this workshop is comprised of hands-on activities in which participants will create redacted samples of data that maintain research integrity and usefulness. Learning outcomes include: 1) Develop fluency with generating random samples in order to make analysis with large files more manageable 2) Know how to assess the identification risk of specific variables within a dataset in order to protect the identity of human subjects 3) Create a Jupyter Notebook workflow that enables cleaning, redacting, and sharing data for research use 4) Learn some fundamental Pandas features for exploring, cleaning, and transforming data” |
Intro to Network Analysis and Visualization using Gephi | Class 1, Afternoon | “A network is a way of specifying relationships among a collection of entities or actors. Networks come up in a variety of situations; for example, they can describe relationships between characters in literary works, how authors cite each other in a particular discipline or how people interact on social media. Through a combination of lecture and activities, this three-hour workshop will provide an introduction to network analysis and visualization using a free, open-source tool called Gephi: https://gephi.org/.
After taking this workshop, participants will be able to: • Recognize networks and situations that call for network visualization and analysis • Use appropriate terms and statistics to describe networks • Understand network data formats and format data for use in Gephi • Use Gephi to load, visualize, analyze, and publish network graphs We will also provide recommendations for other network visualization and analysis tools, and sources of network data. This workshop is aimed at those new to networks and Gephi.” |
SSHOC Open Science and Research Data Management train-the-trainer | Class 2, Afternoon | SHORT SUMMARY
The Social Sciences and Humanities Open Cloud (SSHOC) project’s first Open Science and Research Data Management Train-the-Trainer Bootcamp targets OS and RDM professionals and aims to help them build and/or improve their training and education programs for local researchers. The bootcamp will be structured as an ongoing interactive exercise during which participants will be trained to prepare, implement and evaluate their own training programmes. Participants will discuss and apply training methodologies and styles, exchange experiences, and provide tips to improve training skills. Topics that will be covered include the development of learning objectives, the use and design of interactive activities, adaptation of training methodologies to different learning styles, practical workshop preparation tips, and evaluation techniques. In the beginning of the bootcamp, the trainers will provide an overview of available SSH training materials (toolkit) with a particular focus on RDM and OS. These include materials on, for instance, GDPR, sensitive data, FAIR and open data. Throughout the bootcamp, participants are encouraged to use the existing materials, assess their usefulness for their own training and define opportunities and gaps. TRAINERS Ricarda Braukmann, PhD; Program leader social sciences, DANS – Netherlands Institute for Permanent Access to Digital Research Resources Dr. Ricarda Braukmann works as program leader social sciences at Data Archiving and Networked Services (DANS). DANS is a national data center in the Netherlands promoting sustainable data storage, Research Data Management (RDM) and Open Science. Ricarda is part of the policy and communication department at DANS and involved in several (international) projects such as FREYA or SSHOC where she is working on engagement and training activities. In the past Ricarda has been involved in the development of the CESSDA Data Management Expert Guide (cessda.eu/DMEG), an online RDM training for researchers and its train-the-trainer toolkit. Tanya Yankelevich, Training Coordinator, LIBER Europe – Association of European Research Libraries Tanya is a Training Coordinator for the Association of European Research Libraries (LIBER) and as such is responsible for facilitating training activities within LIBER’s projects and supporting the training needs of LIBER and its network. With her extensive experience in education and strategy across a wide range of subjects in educational institutions, non-governmental and international organisations, Tanya appreciates the value of tailoring each training session to its specific audience and puts extra effort in ensuring quality and fun as top priorities of every workshop. Her commitment to full and equal access to quality education for all, as well as her passion for collaboration and innovation drive her to try new approaches and inclusive education models that best fit training needs. |
What is the DDI? An intro… | Class 3, Afternoon | “Are you interested to learn about the international metadata standard DDI and how it beneficially can be used in the work of your organization?
This workshop provides an introductory overview of the work products of the DDI Alliance and provides practical examples on how DDI beneficially can be used by organizations and institutions that manage research data. The overall approach of the course is agnostic to versions of DDI, while some of the examples shown will relate to specific versions of the DDI (DDI-Codebook, DDI-Lifecycle or the DDI4 Core to be released in 2020). The main focus of the course will be on the following areas:
|
Statistical Disclosure Control in a secure data environment | Class 4, Afternoon | “Analysts are demanding access to more data about individuals and organisations than ever before. Such data are available in the UK, however due to the level of detail, they are typically accessed by analysts in a secure data environment. The statistical results generated by analysts in secure data environments are made available only after they undergo a Statistical Disclosure Control (SDC) review, to ensure that the results do not reveal the identity or contain any confidential information about a data subject. This SDC review is carried out by the analysts that have produced the results and/or the staff working in the secure data environment (output checkers). An important responsibility for the organisation hosting the secure data environment is therefore to ensure that analysts and output-checking staff are appropriately trained and skilled in applying SDC.
This workshop is designed for managers running a secure data environment. Drawing on the SDC training and resources we have developed in the UK, the workshop will teach participants how to train analysts to apply SDC and how to make SDC work in practice, including managing the SDC process and workload efficiently. Participants will also learn about how to devise SDC materials for analysts and output checkers, and how to train and assess output-checking staff. We will use a range of hands-on exercises to explore these topics. Following the workshop, we will also provide participants with access to a set of SDC materials that they can use to help train their staff and analysts.” |