Psychology Research Experience Program (PREP) provides mentoring and experience to undergraduates who have an interest in a scientific psychology career. LUCID partnered with PREP to create a handson data science workshop series. LUCID graduate students will facilitate the data science workshops.
This virtual data science workshop will be held on Wednesdays from 45p starting June 16th, 2021 look for the meeting link in your email.
Overview:
In the workshops students will be introduced to datascience environments, concepts, and applications, through the use of JuPyteR notebooks. LUCID facilitators will introduce a series of data science concepts via online materials and handson virtual sessions. Students will work through examples and demos in the notebook environment, with guidance from LUCID graduate students.
Goals:
For PREP students to gain a sense of 1) how to work with an R or Python integrated development environment, 2) the kinds of things one can do with a range of datascience tools, and 3) how to continue learning about and working with these tools in the future. Note that the goal will not be specifically to teach programming in R, Python, or any other language, but how to work interactively with and adapt notebooks that carry out common datascience tasks, and to get a general sense of what the methods are used for and how they might be applied to one’s own data.
Materials and Session Outlines:
This will be updated with materials and facilitator outlines as they become available.
Schedule:
Date  Facilitator  Topic 
6/16  Tim Rogers  Intro to Data Science 
6/23  Laura Stegner 
Natural Language Processing

6/30  Kendra Wyant  Regularization 
7/7  Melissa Schoenlein  Mixed Linear Models 
7/14  Vince Frigo  Principal Component Analysis 
7/21  Scott Sievert  Optimization 
7/29  Lowell Thompson  Convolutional Neural Networks 
This is an accordion element with a series of buttons that open and close related content panels.
Natural Language Processing (NLP)
Session 1: Introduction to Natural Language Processing
Prepared by Laura Stegner, stegner@wisc.edu
What is Natural Language Processing?
Natural Language Processing (NLP) can be broadly thought of as the computational tools used to help computers understand and manipulate spoken or written natural language to do useful things. This goal can be achieved with the help of various NLP tasks, such as:
 Part of speech taggings
 Speech recognition
 Word sense disambiguation
 Sentiment analysis
 Natural langauge generation
 Named entity recognition
 Coreference resolution
Each of the above tasks is briefly described in this article by IBM.
Practically, NLP is present in our everyday lives. Some common examples include autocorrect, autocomplete, related search terms in a web engine, email filtering, smart agents (e.g. Siri or Alexa), and machine translation (e.g. Google Translate). It is also useful in business applications such as to analyze reviews or to create automated calling systems and chat bot assistants.
When would I want to use NLP?
While NLP is being readily implemented in everyday products, it is also greatly useful in data science. NLP can be used to convert messy, unstructured natural language responses (such as interview data or open responses to survey questions) into more structured, processable data forms. Using NLP techniques to analyze data can serve to speed up processing time and also eliminate inconsistencies from manual analysis.
Preparation
Prior to our meeting, please review the following materials:
 (optional but interesting) Article that walks through the history of NLP: https://machinelearningmastery.com/naturallanguageprocessing/
 High level introduction to NLP (12minute video): https://www.youtube.com/watch?v=fOvTtapxa9c
 Slightly different take on NLP (4minutes video): https://www.youtube.com/watch?v=d4gGtcobq8M
 Lecture that introduces sentiment analysis (7minte video): https://www.youtube.com/watch?v=S4z0UG07b0
 Article about bias in NLP: https://towardsdatascience.com/biasinnaturallanguageprocessingnlpadangerousbutfixableproblem7d01a12cf0f7
 Short article about general ethical considerations when using NLP in a clinical setting: https://arxiv.org/pdf/1703.10090.pdf
Also think about the following. We will have a discussion related to some of these topics 🙂
 Times you have encountered NLP in either your research or your daily life.
 Situations where you don’t use NLP but why it would come in handy, and how.
 Why we should care about the ethical considerations of NLP in data science.
Additionally, install the following packages in Python 3:
 nltk:
pip3 install nltk==3.3
orpython3 m pip install nltk==3.3
Additional Reading / Reference
Chowdhury, G.G. (2003), Natural language processing. Ann. Rev. Info. Sci. Tech., 37: 5189. https://doiorg.ezproxy.library.wisc.edu/10.1002/aris.1440370103
Hovy, D., & Spruit, S. L. (2016, August). The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 591598). https://www.aclweb.org/anthology/P162096.pdf
Leidner, J. L., & Plachouras, V. (2017, April). Ethical by design: Ethics best practices for natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing (pp. 3040). https://www.aclweb.org/anthology/W171604.pdf
Tutorial and demo materials are on Laura’s github site:
https://github.com/lstegner/nlptutorialPREP2021/tree/main/tutorialmaterials
Regularization
Introduction to Regularization with Kendra Wyant
What is Regularization?
Regularization is a type of regression that imposes a penalty to coefficients in complex models. This penalty reduces overfitting by introducing some bias into the model. As we see with the biasvariance tradeoff, introducing some bias can reduce variance in model predictions on new data making the model more generalizable.
Types of regularization
* Ridge regression: variables with minor contribution have their coefficients close to zero. However, all the variables are incorporated in the model. This is useful when all variables need to be incorporated in the model according to domain knowledge.
* Lasso regression: the coefficients of some less contributive variables are forced to be exactly zero. Only the most significant variables are kept in the final model.
* Elasticnet regression: the combination of ridge and lasso regression. It shrinks some coefficients toward zero (like ridge regression) and set some coefficients to exactly zero (like lasso regression).
List of Related Topics/Ideas
We won’t be able to cover all of these topics due to time, but I will provide resources and code for anyone who is interested in exploring these further or using them in their own research. I am also happy to chat more outside the workshop!
 Prediction vs Explanation in Psychology
 Overfitting
 Bias/variance tradeoff
 Test and training sets
 Crossvalidation and resampling
Preparation
Watch:
StatQuest Youtube Series
 Machine learning fundamentals – bias and variance (6:35) https://www.youtube.com/watch?v=EuBBz3bIaA
 Ridge regression clearly explained (20:26) – https://www.youtube.com/watch?v=Q81RR3yKn30
 Lasso regression clearly explained (8:18) – https://www.youtube.com/watch?v=NGf0voTMlcs&t
 Elasticnet regression clearly explains (5:18) – https://www.youtube.com/watch?v=1dKRdX9bfIo
Optional: Machine Learning Fundamentals: Cross Validation (6:04) https://www.youtube.com/watch?v=fSytzGwwBVw
Read:
 Skim the first 10 pages of Yarkoni and Westfall (2017) https://www.youtube.com/watch?v=1dKRdX9bfIo
 Read this blog post on overfitting https://www.ibm.com/cloud/learn/overfitting
Install Software:
 We will be using R and RStudio
 Install the following packages in RStudio:
install.packages(“tidyverse”)
install.packages(“tidymodels”)
install.packages(“kableExtra”)
install.packages(“skimr”)
install.packages(“naniar”)
install.packages(“doParallel”)
install.packages(“mlbench”)
install.packages(“vip”)
install.packages(“Matrix”)
install.packages(“glmnet”)
Additional Resources
Coding
 R for Data Science – https://r4ds.had.co.nz/
 Tidyverse style guide – https://style.tidyverse.org/
 Julia Silge blog – https://juliasilge.com/blog/
 Tidy modeling with R – https://www.tmwr.org/
Machine learning resources
 Introduction to statistical learning – https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6009dd9fa7bc363aa822d2c7/1611259312432/ISLR+Seventh+Printing.pdf
 Applied predictive modeling – https://vuquangnguyen2016.files.wordpress.com/2018/03/appliedpredictivemodelingmaxkuhnkjelljohnson_1518.pdf
I am looking forward to meeting all of you on Wednesday. Please don’t hesitate to reach out about anything (kpaquette2@wisc.edu). I am happy to talk about data science, PREP, Madison, grad school, and more!
Demo script and other resources can be found on Kendra’s github site:
Mixed Linear Models
Introduction to Mixed Linear Models with Melissa Schoenlein
Mixed linear models are a type of analysis used to evaluate data with nonindependence that cannot otherwise be analyzed with regular linear regression.
What is nonindependence/nonindependent data?
Nonindependence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are nonindependent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).
Nonindependence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.
Why/when should I use mixed linear models?
Using regular linear regression when data is nonindependent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a nonindependent relationship in the data.
List of ideas/concepts/tools that are associated with this topic
Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.
Nonindependence
Fixed versus random effects
Lme4 package in R
Preparation:
In preparation for our virtual workshop Wednesday 7/7 at 4pm CST, please watch and read the following materials.
 Watch videos 13, 11, and 16 from this multipart video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
 Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
 Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
 Install the following packages in R: lme4, ggplot2
Optional additional resources if you’re interested in learning more:
Videos:
A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.
Tutorials:
A Github repo with a 3part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.
A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.
Papers:
This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.
Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.
Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!
Principal Component Analysis
Introduction to Principal Component Analysis with Vince Frigo
 Read through the PCA introduction
 Try a simple PCA
 Use PCA results to plot
Optimization
Optimization with Scott Sievert
Watch Scott’s Intro Video: box.stsievert.com/prep
The following can also be found on Scott’s github: github.com/stsievert/PREP21
Hello PREP students! My learning objectives are to answer these questions:
 Why should I care about optimization?
 What are the basics of optimization? How do I get a better solution?
 Where does optimization fail?
“Optimization” is producing a model that accurately represents data, aka “fitting” a model to data. Importantly, the choice of “model” and “data” are perhaps more important than the specific method of fitting the model to the data. In short, optimization is what happens with this code:
from sklearn.linear_model import LinearRegression
estimator = LinearRegression()
# X and y are standins for other data; they could easily from a CSV
X = [[1, 2], [3, 4]]
y = [3, 5]
est.fit(X, y)
In this lesson, I’ll try to open up the black box that happens when you call fit
. I’ve selected about an hour’s worth of video for you to watch, and will try to highlight some relevant issues in person.
Note: optimization is heavy in mathematics. I will try to illustrate optimization without relying on mathematics.
Background
What’s optimization?
Optimization is a process to “fit” a “model” to “data.”
 Data, typically some features and a label for each example.
 A model which will try to predict the label from a feature factor.
 A loss function that characterizes how poorly the model is performing for a specific example.
“Fitting” means “can the model accurately predict an unseen example?” Here are some good background videos on the components above:
 What’s optimization? youtube.com/watch?v=x6f5JOPhci0 (10:08) provides a general overview of optimization methods (and tradeoffs of those methods) and and some common issues in optimization in a realworld example.
 How are machine learning (ML) and optimization related? youtube.com/watch?v=NzwMV2b7jbQ (10:31) introduces ML models, and introduces how to find it. In addition the primary goal given noisy/nonstandard examples?
How are models found?
The videos above provide a general overview of machine learning/optimization and a general idea of what happens inside fit
. Now, let’s get into some specifics on how to find the best model for the models mentioned in “Mixed Linear Models”:
 How is linear regression performed? youtube.com/watch?v=PaFPbb66DxQ (9:21).
 How is the minimum “loss” or “error” found in machine learning? youtube.com/watch?v=IHZwWFHWaw (only the first 11:18 is relevant from this 21:00 video).
 Which loss function should I use? youtube.com/watch?v=fr7dfyfB7mI (6:14) steps through different use cases where different loss functions would apply. This is the most important part of ML.
This is enough background to get understand my examples. In the example, I’ll highlight some issues with optimization, included data size, noise and loss functions.
Demo
The videos above are all the material you need for the demo. To follow along for me demos, visit github.com/stsievert/PREP21/blob/master/README.md
Want to learn more?
This material is not required for the example.
Here are some other useful videos:
 How do I score a model? youtube.com/watch?v=rY5pdNW7jKM (4:40) steps through the data you should use, a critical choice. (I can talk all day about this).
 How does classification work? And how can it be modified to support more complex data? youtube.com/watch?v=Z4aojJpdg&t=5m40s (12:23)
 Which model class should I use? How to choose a model class: youtube.com/watch?v=7jjzMZOdPZw (18:37)
 What’s a neural network? youtube.com/watch?v=aircAruvnKk (19:13)
Also, I would skim Chapter 7 of “Shape” by Jordan Ellenburg (23 pages). It’s light reading, and stitches a good story of optimization. The author, Jordan Ellenburg, is a mathematics professor at UW–Madison and experienced with optimization.
This Chapter is found in #prep channel in the LUCID slack workspace: wisclucid.slack.com
In addition, I’ve written a blog series on optimization that try to introduce the math behind optimization:
 “Least squares and regularization,” which steps through the basics of linear regression stsievert.com/blog/2015/11/19/inversepart1/
 “Finding sparse solutions to linear systems,” which examines a particular type of regularization (and has some fancy interactive widgets to understand what the minimization is doing) stsievert.com/blog/2015/12/09/inversepart2/
 “Gradient descent and physical intuition for heavyball acceleration with visualization”, which looks at a method to modify optimization methods. stsievert.com/blog/2016/01/30/inverse3/
Convolutional Neural Networks
CNNs with Lowell Thompson
In this week’s session we will be learning about neural networks, focusing primarily on convolutional neural networks (CNNs). CNNs have become a useful tool for the development of selfdriving cars, object and face recognition software, medical imaging analysis (e.g., MRI), and many other areas. These models can be simple to build using tools like TensorFlow and Pytorch, the latter of which we’ll use for our demo. Their inner workings, however, combine nearly all of the tools introduced throughout this workshop including linear regression, regularization, optimization, and dimensionality reduction. I hope to provide a brief introduction to CNNs, give you some handson experience with a prebuilt model and then provide some time for discussion.
Session outline:
 Introduction to Neural Networks
 Introduction to CNNs
 Demo session with a prebuilt CNN
 Discussion
Preparation for the workshop:
 Watch the following videos:
 Installation Requirements
 Python https://www.python.org
 Pytorch https://pytorch.org/
 Numpy
 Matplotlib
 Read these articles
 Pull code from GitHub
If you have trouble viewing any of the materials, please let me know (lwthompson@wisc.edu).
Resources and Sessions from 2020:
This is an accordion element with a series of buttons that open and close related content panels.
Support Vector Machines
Session 1: Support Vector Machines (SVM) with Kushin Mukherjee
Support Vector Machines (SVMs) deal with a fundamentally simple problem – how do we divide up datapoints using some form of meaningful decision boundary in a supervised learning setting? This approach gets its name from support vectors, a subset of the labeled data points whose dot products help in determining the decision boundary.
In contrast to approaches like simple neural networks or leastsquares classifiers SVMs have 2 overall advantages that are important to consider together:
 They do not get stuck in local minima. If the data are linearly separable, the algorithm will always find the same ‘best’ decision boundary
 If the data aren’t linearly separable, the SVM approach supports a transformation of the dot products in a space where the data are linearly separable. This is what’s known as the ‘kerneltrick’ in SVMs.
(Note: While I do distinguish the SVM approach from simple neural networks, it has been shown that there are specific classes of neural networks that are equivalent to kernelmethods such as those in SVM. Here’s a brief summary – What are the Mathematical Relationship between Kernel Methods and Neural Networks
List of ideas/concepts/tools that are associated with this topic
 Classification
 Supervised learning
 Linear separability
 Kernel methods
Preparation for meeting:
First Watch: Patrick Winston’s lecture on SVMs is one of the easiest to follow and assumes a very minimal background in linear algebra and multivariable calculus: Youtube
Try this out second! You will need Jupyter and the necessary libraries installed. A python based implementation of SVM using scikitlearn: Stackabuse
Additional Optional Resources:
Videos:
One might like Andrew Ng’s lecture on the same from 2018, which is a bit more recent, but SVMs haven’t changed much over the past decade: Youtube (start from 46:20)
Online tutorials:
To get a stronger grasp on the mathematics behind SVMs and do some ‘handson’ work with them I recommend this site: SVM Tutorial
Here’s another jupyter notebook based python implementation of SVMs using scikitlearn: Learnopencv
Applied Papers:
The following is useful for seeing how these tools are used in cognitive science more broadly.
Here are 2 papers that employ SVMs in NLP and cognitive neuroscience settings
Shallow semantic parsing of sentences using SVMs: aclweb
Effective functional mapping of fMRI data using SVMs: ncbi
Theory Papers:
The original SVM paper by Vladamir Vapnik: image.diku
Jupyter Notebooks Tutorial
Jupyter Notebooks Online Tutorial with Pablo Caceres
The following is a great resource to watch/read at your own pace, and feel free to contact Pablo with any questions.
Unix Shell Tutorial
R Markdown
Introduction to R Markdown with Gaylen Fronk
R Markdown provides an authoring framework for data science in R. With a single R Markdown file, you can not only write, save, and execute your code but also communicate your process and results with an audience using highquality, reproducible output formats.
More detail about R Markdown
R Markdown builds off tools already available in R and RStudio to provide an integrated environment for processing, coding, and communicating. An R Markdown file can include text, chunks of code, images, links, figures, and tables. While you’re working in your RStudio environment, your file operates similarly to a normal R script (a .R file) – you can write, edit, and evaluate code to work with your data. At any point, you can “knit” your file. Knitting runs, evaluates, and compiles your R Markdown file into your desired output (e.g., HTML, PDF) to create a single document that includes all the components of your written file plus the results. This knit file is ready for highquality scientific communication with any audience. If you’ve ever seen nice examples of R code and output online, it was probably made using R Markdown.
Why should I use R Markdown?
R Markdown is particularly helpful if…
 You already work in R or RStudio and would like some additional tools at your disposal
 You value reproducible output
 You would like to be able to share your work with people who are less familiar with R (or coding more generally)
R Markdown combines the data wrangling and analytic tools of R with highclass scientific communication. It can become your onestopshop for sharing your data science.
Prepare for the LUCID/PREP Data Science Workshop on R Markdown:
In preparation for our video meeting next week (Wednesday 7/1 at 4pm CST), please watch, read, or review the following materials.
 Begin with this 1minute video of what’s possible with R Markdown.
 Read Chapter 1 (Installation) from R Markdown: the Definitive Guide (Note: you should have R & RStudio installed prior to our workshop. Confirm in advance that you can open these applications.)
 Read Chapter 2 (Basics) from R Markdown: the Definitive Guide
 Read this section of Chapter 3 (Outputs: HTML) from R Markdown: The Definitive Guide
 Review this cheat sheet and have it handy for our meeting
Optional additional resources if you’re interested in learning more:
 This paper from the Statistics area of arXiv.org discusses how R Markdown can improve data science communication workflow. It’s perfect for people interested in understanding why R Markdown may be beneficial and receiving examples of its usecases.
 This online book contains lessons on R Markdown basics, specific output formats, inline and chunk code, tables, interactive websites, presentations, using multiple coding languages, and more. It’s perfect for someone looking for a comprehensive (yet still quite succinct) tutorial on using R markdown
 The Communication section from the R for Data Science online book includes several chapters on R markdown (the tidyverse’s preferred method for statistical and scientific communication)
 This online code from GitHub Gist provides an example/walkthrough of using R Markdown.
A note from Gaylen:
If you have questions about these materials or other questions you’d like answered during our workshop, you can submit them via this form. Please try to do this by Tuesday 6/30 at 5pm CST so that I can aggregate questions in advance.
Workshop will be led by Gaylen Fronk. You can email me at gfronk@wisc.edu if you have problems accessing these materials or installing R/RStudio. Looking forward to meeting you all!
Regression using Jupyter Notebooks
Optimization and model regularization with Owen Levin
Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.
Overview:
 linear regression as an optimization problem
 introduce loss functions
 curve fitting as optimization
 Is a perfect fit actually perfect? (wacky zero loss examples)
 model regularization
 small weights
 sparsity
Preparation:
1.If you haven’t already downloaded anaconda or another python distribution please do so.
2.View this video: Owen’s Regression Intro
3.Jupyter Notebook: Optimization & Regularization
Or check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop
Mixed Linear Models
Introduction to Mixed Linear Models with Melissa Schoenlein
Mixed linear models are a type of analysis used to evaluate data with nonindependence that cannot otherwise be analyzed with regular linear regression.
What is nonindependence/nonindependent data?
Nonindependence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are nonindependent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).
Nonindependence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.
Why/when should I use mixed linear models?
Using regular linear regression when data is nonindependent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a nonindependent relationship in the data.
List of ideas/concepts/tools that are associated with this topic
Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.
Nonindependence
Fixed versus random effects
Lme4 package in R
Preparation:
In preparation for our video Wednesday 7/15 at 4pm CST, please watch and read the following materials.
 Watch videos 13, 11, and 16 from this multipart video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
 Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
 Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
 Install the following packages in R: lme4, ggplot2
Optional additional resources if you’re interested in learning more:
Videos:
A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.
Tutorials:
A Github repo with a 3part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.
A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.
Papers:
This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.
Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.
Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!
Data Visualization with Python in Jupyter Notebooks
Data Visualization with Python in Jupyter Notebooks with Pablo Caceres
Cross Validation
Cross Validation with Sarah Sant’Ana
Cross validation is a common resampling technique used in machine learning studies. Broadly, cross validation involves splitting data into multiple training and testing subsets to increase generalizability of the model building and evaluation processes. There are multiple types of cross validation (e.g. kfold, bootstrapped), but all serve two primary purposes:
 To select the best model configurations (e.g. what type of statistical model will perform best, which sets of features will perform best, covariate selection, hyperparameter tuning, outlier identification approaches, predictor transformations, and more).
 To evaluate the expected performance of our models in new data (i.e. on individuals who were never used in model building/selection)
Why should I use cross validation?
You should use cross validation if..
 You are fitting a statistical model with hyperparameters that need tuning (e.g. elasticnet logistic regression, random forests, svm)
 You are considering multiple combinations of model configurations (e.g. features, statistical algorithms, data transformations)
 You want to consider a large number of predictive features or you do not want to rely on theory to guide identification of predictive features
 You want to build predictive models that will generalize well to new data (i.e. you want your model to be applied in some way)
List of ideas, concepts, or tools that are associated with this topic
 R/RStudio (especially the caret package, tidymodels, and parsnip packages)
 Python
 Common types of cross validation (CV): bootstrapped CV, kfold CV, nested CV
 Basic knowledge of linear and logistic regression
 Bias/variance trade offs in model fitting and evaluation
 Generalizability of predictive models (why its important, how to prioritize it, and how to assess it)
In preparation for our meeting next Tuesday, please review the following materials:
 For framing, please read the beginning of Yarkoni & Westfall (2017) http://jakewestfall.org/publications/Yarkoni_Westfall_choosing_prediction.pdf through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during the session – it is not necessary to have a crystal clear understanding!
 Watch this solid 6 minute explanation of cross validation by Statquest https://www.youtube.com/watch?v=fSytzGwwBVw
 Skim this “big picture” blog post that provides more clarity surrounding distinctions between model evaluation and selection A “short” introduction to model selection
During the meeting on Tuesday
 Plan on a discussion about prediction vs explanation in psychological research. I want to help you think of how you might apply cross validation in your work if you are interested 😊
 I will be walking us through the attached Cross Validation Markdown document (open this link then download, google will default to open as a gdoc that is not functional) to provide you some code for implementing cross validation. No need to read this beforehand, but you can have it open during the session if you’d like to follow along.
 Feel free to send me any questions beforehand or ask during the session! Happy to talk research, data science, or grad school as would feel beneficial to you all. My email is skittleson@wisc.edu
Additional Materials (not required, just for your reference)
Books
 Here are selected readings on cross validation from two *free* online textbooks: James et al. (2013) Chapter 5: Resampling Methods (pp 175 – 186) and Kuhn and Johnson (2018) Chapter 4: Resampling Techniques (pp 67 – 78). These books are amazing for learning about any sort of applied statistical learning – highly recommend!
Online tutorials (blogs and code examples):
 This is an R Markdown file written by the creator of the caret package in R (one of the most used machine learning packages in R to date). It explains how to tune the various types of hyperparameters using CV within carets train function. Even if you don’t plan to use R, it is helpful to see what types of parameters are tuned for different models and provides examples of creating and evaluating search grids, alternate performance metrics, and more. Model training and tuning
 This is a nice (but lengthy) R Markdown example of approaching a classic machine learning problem (product price estimation) and showcases hyperparameter tuning of a couple of different algorithms (and their comparison): Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial. This is geared towards a more advanced beginner – It still walks you through everything, but incorporates more robust data cleaning and exploration before model fitting.
Videos:
 This video is a good walkthrough using Kfold crossvalidation in python to select optimal tuning parameters, choose between models, and select features: Selecting the best model in scikitlearn using crossvalidation
 A short 4 minute tutorial about how to tune various types of statistical learning models within cross validation using the caret package in R. It doesn’t discuss much of the theory and is more appropriate for application focused users who are just trying to figure out how to implement parameter tuning within CV: R Tutorial – Hyperparameter tuning in caret
Papers:
 This paper describes the impact of using different CV types for parameter selection and model evaluation: Bias in error estimation when using crossvalidation for model selection.This requires intermediate level understanding of using CV for parameter selection. Many people using machine learning in applied contexts are using improper CV methods that bias their model performance estimates. We should be using nested CV (or bootstrap CV with a separate validation set) if we are planning to select model parameters and generate trustworthy performance metrics
 Really cool preprint that describes sources of bias in ML resampling methods due to incorrect application in psychological research https://psyarxiv.com/2yber/. A more intermediate level read because it requires some understanding of multiple types of CV methods.
Neural Networks
Neural Networks with Ray Doudlah
Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and selfdriving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, a computational model inspired by the brain.
 Introduce neural networks and their general architecture
 Introduce convolutional neural networks
 Implement a convolutional neural network to solve a hand writing recognition task
 Watch the following videos:
 Read these articles
 Pull code from GitHub
Overleaf
Overleaf by Glenn Palmer
LaTeX is a typesetting system that can be used to write academic papers and create professionallooking documents. Users type in plain text format, but mark up the text with tagging conventions, and the nicelyformatted result is shown in an output file. Overleaf is an online platform that can be used to create and edit LaTeX documents. You can share and simultaneously edit documents with collaborators, similar to the way you collaborate on a Google Doc.
For a highlevel overview of LaTeX, Overleaf, and the resources below watch this video:
e
Videos
 This playlist of videos is a good starting place. They were made by a company called ShareLaTeX, which recently merged with Overleaf. These videos give a good idea of how to get started using LaTeX with an online editing system.
Online tutorials
 For more detail, and/or for a range of written tutorials, the Overleaf documentation page has a wide range of information to help get started, or to answer specific questions you might have as you get used to using LaTeX.
Cheat sheet
 For a quick reference as you’re writing, this cheat sheet includes a bunch of commands for various formatting options, with a focus on writing scientific papers.
Resources and Sessions from 2019:
This is an accordion element with a series of buttons that open and close related content panels.
Introduction to Data Science with R
Session 1: Introduction to data science with R with Tim Rogers
This session will introduce you working with data in an “integrated development environment” or IDE using the freely available and widelyused software package R. We will briefly discuss what is meant by the term “data science,” why data science is increasingly important in Psychology and Neuroscience, and how it differs from traditional statistical analysis. We will then get a sense for how IDEs work by building, from data generated in the workshop, an interactive graph showing the structure of your mental semantic network.
Preparation for the workshop: (TO DO before arriving on Tuesday!)
– Install R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
– Start Swirl as instructed at the website and install the first course module by following the prompts
– Run yourself through the first course module
Time to complete: 4560 minutes. Feel free to work with a partner or in groups!
Overview:
We learned how to create semantic clusters from lists of animals. Tim created this Semantic Network Demo to view the interactive graph and get the code that was used to generate the semantic clusters. The demo walks through the process of building and visualizing graphs
Using Github & Jupyter Notebooks
Session 2: Using Github, JuPyteR notebooks in several data science environments with Pablo Caceres
Preparation for the workshop:
Fitting & Evaluating Linear Models
Session 3: Fitting and evaluating linear models with John Binzak
This session will introduce you to working with linear regression models using R. We will briefly discuss why linear regression is useful for Psychology and Educational research, using the topic of numerical cognition as an example. We will play an educational game to generate our own data in the work shop, form predictions, and test those predictions by modeling gameplay performance. Through this exercise we will cover how to fit linear regression models, assess the fit of those models, plot linear relationships, and draw statistical inferences.
Preparation for the workshop:
– Be ready to uses R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
–Install the “Regression Models” swirl module using the following commands in R
> library(swirl)
> swirl::install_course(“Regression Models”)
> swirl()
– Run yourself through lessons 16 (IntroductionMultiVar Examples) and continue based on your interest.
Time to complete: 4560 minutes. Feel free to work with a partner or in groups!
Optimization & Model Regularization
Session 4: Optimization and model regularization with Owen Levin
Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.
 linear regression as an optimization problem
 introduce loss functions
 curve fitting as optimization
 Is a perfect fit actually perfect? (wacky zero loss examples)
 model regularization
 small weights
 sparsity
Preparation: If you haven’t already downloaded anaconda or another python distribution please do so.
Overview: Please check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop
Pattern Recognition & Varieties of Machine Learning
Session 5: Pattern recognition and varieties of machine learning with Ashley Hou
Owen and Ashley will be cofacilitating this session.
This session will introduce basic concepts in machine learning. We will first discuss an overview of the steps involved in the machine learning process and the two main categories of machine learning problems. Then, we will walk through examples in both supervised and unsupervised learning, specifically classification using SVMs (discussing the regularization perspective) and clustering using the kmeans clustering algorithm. We will conclude with brief discussion on other popular machine learning algorithms, when to use them, and good resources to learn more.
Preparation for the workshop: 1. review session 4’s overview 2. have a working Python3 distribution, scikitlearn, matplotlib, numpy, pandas, and jupyter notebook
CrossValidation
Session 6: Crossvalidation with Sarah Sant’Ana
Today’s session will introduce the concept of cross validation. Using instructional videos from the Datacamp Machine Learning toolbox, we will walk through basic examples of cross validation in R using the caret package. We will be using two publicly available data sets in R for example code.
Our goals for this session are :
Preparation for the workshop:
– Be ready to uses R, R Studio
– Read the Yarkoni & Westfall (2017) through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during session – it is not necessary to have a crystal clear understanding!
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 11001122.
Neural Networks
Session 7: Neural Networks with Ray Doudlah
Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and selfdriving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, which basically allows the computer to analyze data similar to the way the human brain analyzes data.
With recent advancements in computing power and the explosion of big data, we can now implement large models that perform endtoend learning (deep learning). This means that we can create a model, feed it tons and tons of data, and the model will learn features from the data that are important for accomplishing the task.
Session outline:
• Introduce the simplest neural network, the perceptron
• Discuss the general architecture for neural networks
• Implement a neural network to solve a hand writing recognition task
• Introduce deep learning (convolutional neural networks)
• Implement a deep neural network to solve a hand writing recognition task
Preparation for the workshop:
 Watch the following videos:
 Pull session 7 materials from GitHub
Bayesian Inference
Session 8: Bayesian Inference: estimating unobservable variables with Lowell Thompson
This session will focus on introducing the utility of a common statistical method known as Bayesian Inference. We’ll focus first on Bayes Theorem and learn how it relates to our understanding of perception as an inverse problem. Since the majority of research in perception relies on various psychophysical methodologies to assess behavior, we’ll also walk through how you might generate your own experiments in python using a package called Psychopy. After obtaining some data, we’ll look at a specific example that illustrates the utility of Bayesian inference in modeling our own behavioral data. Lastly, we’ll go over Bayesian inference in the broader context of data science.
Session Outline:
 Introduce Bayes Theorem
 Understand the utility of Bayesian inference in a variety of contexts
 Learn the basics of Psychopy to create basic experiments
 Use your own data from an orientation discrimination task to illustrate how Bayesian inference can be used.
Preparation: Please try and install Psychopy on your computer prior to the session, and try running one of their tutorials to make sure it works: Psychopy