Psychology Research Experience Program (PREP) provides mentoring and experience to undergraduates who have an interest in a scientific psychology career. LUCID partnered with PREP to create a hands-on data science workshop series. LUCID graduate students will facilitate the data science workshops.

**This virtual data science workshop will be held on Wednesdays from 4-5p starting June 16th, 2021 look for the meeting link in your email.**

### Overview:

In the workshops students will be introduced to data-science environments, concepts, and applications, through the use of JuPyteR notebooks. LUCID facilitators will introduce a series of data science concepts via online materials and hands-on virtual sessions. Students will work through examples and demos in the notebook environment, with guidance from LUCID graduate students.

### Goals:

For PREP students to gain a sense of 1) how to work with an R or Python integrated development environment, 2) the kinds of things one can do with a range of data-science tools, and 3) how to continue learning about and working with these tools in the future. Note that the goal will not be specifically to teach programming in R, Python, or any other language, but how to work interactively with and adapt notebooks that carry out common data-science tasks, and to get a general sense of what the methods are used for and how they might be applied to one’s own data.

### Schedule:

TBD

### Materials and Session Outlines:

This will be updated with materials and facilitator outlines as they become available.

### Resources and Sessions from 2020:

This is an accordion element with a series of buttons that open and close related content panels.

## Support Vector Machines

**Session 1: Support Vector Machines (SVM) with Kushin Mukherjee**

Support Vector Machines (SVMs) deal with a fundamentally simple problem – how do we divide up datapoints using some form of meaningful decision boundary in a supervised learning setting? This approach gets its name from support vectors, a subset of the labeled data points whose dot products help in determining the decision boundary.

In contrast to approaches like simple neural networks or least-squares classifiers SVMs have 2 overall advantages that are important to consider together:

- They do not get stuck in local minima. If the data are linearly separable, the algorithm will always find the same ‘best’ decision boundary
- If the data aren’t linearly separable, the SVM approach supports a transformation of the dot products in a space where the data are linearly separable. This is what’s known as the ‘kernel-trick’ in SVMs.

(Note: While I do distinguish the SVM approach from simple neural networks, it has been shown that there are *specific *classes of neural networks that are equivalent to kernel-methods such as those in SVM. Here’s a brief summary – What are the Mathematical Relationship between Kernel Methods and Neural Networks

List of ideas/concepts/tools that are associated with this topic

- Classification
- Supervised learning
- Linear separability
- Kernel methods

**Preparation for meeting: **

**First Watch:** Patrick Winston’s lecture on SVMs is one of the easiest to follow and assumes a very minimal background in linear algebra and multivariable calculus: Youtube

**Try this out second!** You will need Jupyter and the necessary libraries installed. A python based implementation of SVM using scikit-learn: Stackabuse

**Additional Optional Resources: **

**Videos:**

One might like Andrew Ng’s lecture on the same from 2018, which is a bit more recent, but SVMs haven’t changed much over the past decade: Youtube (start from 46:20)

**Online tutorials:**

To get a stronger grasp on the mathematics behind SVMs and do some ‘hands-on’ work with them I recommend this site: SVM Tutorial

Here’s another jupyter notebook based python implementation of SVMs using scikit-learn: Learnopencv

**Applied Papers:**

The following is useful for seeing how these tools are used in cognitive science more broadly.

Here are 2 papers that employ SVMs in NLP and cognitive neuroscience settings

Shallow semantic parsing of sentences using SVMs: aclweb

Effective functional mapping of fMRI data using SVMs: ncbi

**Theory Papers: **

The original SVM paper by Vladamir Vapnik: image.diku

## Jupyter Notebooks Tutorial

Jupyter Notebooks Online Tutorial with Pablo Caceres

The following is a great resource to watch/read at your own pace, and feel free to contact Pablo with any questions.

## Unix Shell Tutorial

## R Markdown

**Introduction to R Markdown with Gaylen Fronk**

R Markdown provides an authoring framework for data science in R. With a single R Markdown file, you can not only write, save, and execute your code but also communicate your process and results with an audience using high-quality, reproducible output formats.

**More detail about R Markdown**

R Markdown builds off tools already available in R and RStudio to provide an integrated environment for processing, coding, and communicating. An R Markdown file can include text, chunks of code, images, links, figures, and tables. While you’re working in your RStudio environment, your file operates similarly to a normal R script (a .R file) – you can write, edit, and evaluate code to work with your data. At any point, you can “knit” your file. Knitting runs, evaluates, and compiles your R Markdown file into your desired output (e.g., HTML, PDF) to create a single document that includes all the components of your written file *plus* the results. This knit file is ready for high-quality scientific communication with any audience. If you’ve ever seen nice examples of R code and output online, it was probably made using R Markdown.

**Why should I use R Markdown? **

R Markdown is particularly helpful if…

- You already work in R or RStudio and would like some additional tools at your disposal
- You value reproducible output
- You would like to be able to share your work with people who are less familiar with R (or coding more generally)

R Markdown combines the data wrangling and analytic tools of R with high-class scientific communication. It can become your one-stop-shop for sharing your data science.

**Prepare for the LUCID/PREP Data Science Workshop on R Markdown:**

In preparation for our video meeting next week (Wednesday 7/1 at 4pm CST), please watch, read, or review the following materials.

- Begin with this 1-minute video of what’s possible with R Markdown.
- Read Chapter 1 (Installation) from R Markdown: the Definitive Guide (
**Note:**you should have R & RStudio installed prior to our workshop. Confirm in advance that you can open these applications.) - Read Chapter 2 (Basics) from R Markdown: the Definitive Guide
- Read this section of Chapter 3 (Outputs: HTML) from R Markdown: The Definitive Guide
- Review this cheat sheet and have it handy for our meeting

Optional additional resources if you’re interested in learning more:

- This paper from the Statistics area of arXiv.org discusses how R Markdown can improve data science communication workflow. It’s perfect for people interested in understanding why R Markdown may be beneficial and receiving examples of its use-cases.

- This online book contains lessons on R Markdown basics, specific output formats, in-line and chunk code, tables, interactive websites, presentations, using multiple coding languages, and more. It’s perfect for someone looking for a comprehensive (yet still quite succinct) tutorial on using R markdown
- The Communication section from the R for Data Science online book includes several chapters on R markdown (the tidyverse’s preferred method for statistical and scientific communication)
- This online code from GitHub Gist provides an example/walkthrough of using R Markdown.

A note from Gaylen:

If you have questions about these materials or other questions you’d like answered during our workshop, you can submit them via this form. Please try to do this by **Tuesday 6/30 at 5pm CST **so that I can aggregate questions in advance.

Workshop will be led by **Gaylen Fronk**. You can email me at gfronk@wisc.edu if you have problems accessing these materials or installing R/RStudio. Looking forward to meeting you all!

## Regression using Jupyter Notebooks

#### Optimization and model regularization with Owen Levin

Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.

**Overview:**

- linear regression as an optimization problem
- introduce loss functions

- curve fitting as optimization
- Is a perfect fit actually perfect? (wacky zero loss examples)
- model regularization
- small weights
- sparsity

**Preparation:**

1.If you haven’t already downloaded anaconda or another python distribution please do so.

2.View this video: Owen’s Regression Intro

3.Jupyter Notebook: Optimization & Regularization

Or check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop

## Mixed Linear Models

**Introduction to Mixed Linear Models with Melissa Schoenlein**

Mixed linear models are a type of analysis used to evaluate data with non-independence that cannot otherwise be analyzed with regular linear regression.

__What is non-independence/non-independent data?__

Non-independence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are non-independent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).

Non-independence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.

__Why/when should I use mixed linear models?__

Using regular linear regression when data is non-independent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a non-independent relationship in the data.

__List of ideas/concepts/tools that are associated with this topic__

Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.

Nonindependence

Fixed versus random effects

Lme4 package in R

**Preparation:**

In preparation for our video Wednesday 7/15 at 4pm CST, please watch and read the following materials.

- Watch videos 1-3, 11, and 16 from this multi-part video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
- Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
- Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
- Install the following packages in R: lme4, ggplot2

**Optional additional resources if you’re interested in learning more:**

__Videos:__

A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.

__Tutorials:__

A Github repo with a 3-part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.

A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.

__Papers:__

This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.

Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.

** **

Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!

## Data Visualization with Python in Jupyter Notebooks

**Data Visualization with Python in Jupyter Notebooks with Pablo Caceres**

**Introduction**

**clear mental model**based on a set of

**graphical primitives**and carefully designed

**combinatorial rules**, that yield an ample space of graphical displays, avoiding the constraints of chart taxonomies.

**Preparation**

**Optional resources**

## Cross Validation

**Cross Validation with Sarah Sant’Ana**

Cross validation is a common resampling technique used in machine learning studies. Broadly, cross validation involves splitting data into multiple training and testing subsets to increase generalizability of the model building and evaluation processes. There are multiple types of cross validation (e.g. k-fold, bootstrapped), but all serve two primary purposes:

- To select the best model configurations (e.g. what type of statistical model will perform best, which sets of features will perform best, covariate selection, hyperparameter tuning, outlier identification approaches, predictor transformations, and more).
- To evaluate the expected performance of our models in new data (i.e. on individuals who were never used in model building/selection)

**Why should I use cross validation? **

You should use cross validation if..

- You are fitting a statistical model with hyperparameters that need tuning (e.g. elastic-net logistic regression, random forests, svm)
- You are considering multiple combinations of model configurations (e.g. features, statistical algorithms, data transformations)
- You want to consider a large number of predictive features or you do not want to rely on theory to guide identification of predictive features
- You want to build predictive models that will generalize well to new data (i.e. you want your model to be applied in some way)

**List of ideas, concepts, or tools that are associated with this topic**

- R/RStudio (especially the caret package, tidymodels, and parsnip packages)
- Python
- Common types of cross validation (CV): bootstrapped CV, k-fold CV, nested CV
- Basic knowledge of linear and logistic regression
- Bias/variance trade offs in model fitting and evaluation
- Generalizability of predictive models (why its important, how to prioritize it, and how to assess it)

**In preparation for our meeting next Tuesday, please review the following materials:**

- For framing, please read the beginning of Yarkoni & Westfall (2017) http://jakewestfall.org/publications/Yarkoni_Westfall_choosing_prediction.pdf through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during the session – it is not necessary to have a crystal clear understanding!
- Watch this solid 6 minute explanation of cross validation by Statquest https://www.youtube.com/watch?v=fSytzGwwBVw
- Skim this “big picture” blog post that provides more clarity surrounding distinctions between model evaluation and selection A “short” introduction to model selection

**During the meeting on Tuesday**

- Plan on a discussion about prediction vs explanation in psychological research. I want to help you think of how you might apply cross validation in your work if you are interested 😊
- I will be walking us through the attached Cross Validation Markdown document (open this link then download, google will default to open as a g-doc that is not functional) to provide you some code for implementing cross validation. No need to read this beforehand, but you can have it open during the session if you’d like to follow along.
- Feel free to send me any questions beforehand or ask during the session! Happy to talk research, data science, or grad school as would feel beneficial to you all. My email is skittleson@wisc.edu

**Additional Materials (not required, just for your reference)**

**Books**

- Here are selected readings on cross validation from two *free* online textbooks: James et al. (2013) Chapter 5: Resampling Methods (pp 175 – 186) and Kuhn and Johnson (2018) Chapter 4: Resampling Techniques (pp 67 – 78). These books are amazing for learning about any sort of applied statistical learning – highly recommend!

**Online tutorials (blogs and code examples):**

- This is an R Markdown file written by the creator of the caret package in R (one of the most used machine learning packages in R to date). It explains how to tune the various types of hyperparameters using CV within carets train function. Even if you don’t plan to use R, it is helpful to see what types of parameters are tuned for different models and provides examples of creating and evaluating search grids, alternate performance metrics, and more. Model training and tuning
- This is a nice (but lengthy) R Markdown example of approaching a classic machine learning problem (product price estimation) and showcases hyperparameter tuning of a couple of different algorithms (and their comparison): Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial. This is geared towards a more advanced beginner – It still walks you through everything, but incorporates more robust data cleaning and exploration before model fitting.

**Videos:**

- This video is a good walkthrough using K-fold cross-validation in python to select optimal tuning parameters, choose between models, and select features: Selecting the best model in scikit-learn using cross-validation
- A short 4 minute tutorial about how to tune various types of statistical learning models within cross validation using the caret package in R. It doesn’t discuss much of the theory and is more appropriate for application focused users who are just trying to figure out how to implement parameter tuning within CV: R Tutorial – Hyperparameter tuning in caret

**Papers:**

- This paper describes the impact of using different CV types for parameter selection and model evaluation: Bias in error estimation when using cross-validation for model selection.This requires intermediate level understanding of using CV for parameter selection. Many people using machine learning in applied contexts are using improper CV methods that bias their model performance estimates. We should be using nested CV (or bootstrap CV with a separate validation set) if we are planning to select model parameters and generate trustworthy performance metrics
- Really cool preprint that describes sources of bias in ML resampling methods due to incorrect application in psychological research https://psyarxiv.com/2yber/. A more intermediate level read because it requires some understanding of multiple types of CV methods.

## Neural Networks

**Neural Networks with Ray Doudlah**

Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, a computational model inspired by the brain.

**Session outline:**

- Introduce neural networks and their general architecture
- Introduce convolutional neural networks
- Implement a convolutional neural network to solve a hand writing recognition task

**Preparation for the workshop:**

- Watch the following videos:
- Read these articles
- Pull code from GitHub

## Overleaf

**Overleaf by Glenn Palmer**

LaTeX is a typesetting system that can be used to write academic papers and create professional-looking documents. Users type in plain text format, but mark up the text with tagging conventions, and the nicely-formatted result is shown in an output file. Overleaf is an online platform that can be used to create and edit LaTeX documents. You can share and simultaneously edit documents with collaborators, similar to the way you collaborate on a Google Doc.

For a high-level overview of LaTeX, Overleaf, and the resources below watch this video:

e

### Videos

- This playlist of videos is a good starting place. They were made by a company called ShareLaTeX, which recently merged with Overleaf. These videos give a good idea of how to get started using LaTeX with an online editing system.

### Online tutorials

- For more detail, and/or for a range of written tutorials, the Overleaf documentation page has a wide range of information to help get started, or to answer specific questions you might have as you get used to using LaTeX.

### Cheat sheet

- For a quick reference as you’re writing, this cheat sheet includes a bunch of commands for various formatting options, with a focus on writing scientific papers.

### Resources and Sessions from 2019:

This is an accordion element with a series of buttons that open and close related content panels.

## Introduction to Data Science with R

**Session 1: Introduction to data science with R with Tim Rogers**

This session will introduce you working with data in an “integrated development environment” or IDE using the freely available and widely-used software package R. We will briefly discuss what is meant by the term “data science,” why data science is increasingly important in Psychology and Neuroscience, and how it differs from traditional statistical analysis. We will then get a sense for how IDEs work by building, from data generated in the workshop, an interactive graph showing the structure of your mental semantic network.

**Preparation for the workshop: (TO DO before arriving on Tuesday!)**

– Install R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students

– Start Swirl as instructed at the website and install the first course module by following the prompts

– Run yourself through the first course module

Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!

**Overview:**

We learned how to create semantic clusters from lists of animals. Tim created this Semantic Network Demo to view the interactive graph and get the code that was used to generate the semantic clusters. The demo walks through the process of building and visualizing graphs

## Using Github & Jupyter Notebooks

**Session 2: Using Github, JuPyteR notebooks in several data science environments with Pablo Caceres**

**Preparation for the workshop:**

**Overview:**

## Fitting & Evaluating Linear Models

#### Session 3: Fitting and evaluating linear models with John Binzak

This session will introduce you to working with linear regression models using R. We will briefly discuss why linear regression is useful for Psychology and Educational research, using the topic of numerical cognition as an example. We will play an educational game to generate our own data in the work shop, form predictions, and test those predictions by modeling gameplay performance. Through this exercise we will cover how to fit linear regression models, assess the fit of those models, plot linear relationships, and draw statistical inferences.

**Preparation for the workshop: **

– Be ready to uses R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students

–Install the “Regression Models” swirl module using the following commands in R

> library(swirl)

> swirl::install_course(“Regression Models”)

> swirl()

– Run yourself through lessons 1-6 (Introduction-MultiVar Examples) and continue based on your interest.

Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!

## Optimization & Model Regularization

#### Session 4: Optimization and model regularization with Owen Levin

Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.

- linear regression as an optimization problem
- introduce loss functions

- curve fitting as optimization
- Is a perfect fit actually perfect? (wacky zero loss examples)
- model regularization
- small weights
- sparsity

**Preparation:** If you haven’t already downloaded anaconda or another python distribution please do so.

**Overview:** Please check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop

## Pattern Recognition & Varieties of Machine Learning

#### Session 5: Pattern recognition and varieties of machine learning with Ashley Hou

Owen and Ashley will be co-facilitating this session.

This session will introduce basic concepts in machine learning. We will first discuss an overview of the steps involved in the machine learning process and the two main categories of machine learning problems. Then, we will walk through examples in both supervised and unsupervised learning, specifically classification using SVMs (discussing the regularization perspective) and clustering using the k-means clustering algorithm. We will conclude with brief discussion on other popular machine learning algorithms, when to use them, and good resources to learn more.

**Preparation for the workshop:** 1. review session 4’s overview 2. have a working Python3 distribution, scikit-learn, matplotlib, numpy, pandas, and jupyter notebook

## Cross-Validation

#### Session 6: Cross-validation with Sarah Sant’Ana

Today’s session will introduce the concept of cross validation. Using instructional videos from the Datacamp Machine Learning toolbox, we will walk through basic examples of cross validation in R using the caret package. We will be using two publicly available data sets in R for example code.

Our goals for this session are :

**Preparation for the workshop:**

– Be ready to uses R, R Studio

– Read the Yarkoni & Westfall (2017) through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during session – it is not necessary to have a crystal clear understanding!

##### Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100-1122.

**Materials:**Data Science Workshop Cross Validation in R

## Neural Networks

#### Session 7: Neural Networks with Ray Doudlah

Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, which basically allows the computer to analyze data similar to the way the human brain analyzes data.

With recent advancements in computing power and the explosion of big data, we can now implement large models that perform end-to-end learning (deep learning). This means that we can create a model, feed it tons and tons of data, and the model will learn features from the data that are important for accomplishing the task.

**Session outline:**

• Introduce the simplest neural network, the perceptron

• Discuss the general architecture for neural networks

• Implement a neural network to solve a hand writing recognition task

• Introduce deep learning (convolutional neural networks)

• Implement a deep neural network to solve a hand writing recognition task

**Preparation for the workshop:**

- Watch the following videos:
- Pull session 7 materials from GitHub

## Bayesian Inference

#### Session 8: **Bayesian Inference: estimating unobservable variables with Lowell Thompson**

This session will focus on introducing the utility of a common statistical method known as Bayesian Inference. We’ll focus first on Bayes Theorem and learn how it relates to our understanding of perception as an inverse problem. Since the majority of research in perception relies on various psychophysical methodologies to assess behavior, we’ll also walk through how you might generate your own experiments in python using a package called Psychopy. After obtaining some data, we’ll look at a specific example that illustrates the utility of Bayesian inference in modeling our own behavioral data. Lastly, we’ll go over Bayesian inference in the broader context of data science.

**Session Outline:**

- Introduce Bayes Theorem
- Understand the utility of Bayesian inference in a variety of contexts
- Learn the basics of Psychopy to create basic experiments
- Use your own data from an orientation discrimination task to illustrate how Bayesian inference can be used.

**Preparation:** Please try and install Psychopy on your computer prior to the session, and try running one of their tutorials to make sure it works: Psychopy