3 Final Project Assignment

3.1 Overview

This section goes over what’s expected for the final project, Fall 2020.

General Note: Please note that this sheet cannot possibly cover all the “do’s and don’ts” of data analysis and visualization. You are expected to follow all of the best practices discussed in class throughout the semester.

3.2 General info

3.2.1 Goal

The goal of this project is to perform an exploratory data analysis / create visualizations with data of your choosing in order to gain preliminary insights on questions of interest to you.

3.2.2 Teams

For this assignment you will work in teams of 2-4 people. If you wish to select your own partner, please do so by the date specified in CourseWorks. To indicate who you are working with, sign up in the People section of CourseWorks. Do not click the +Group button; rather, drag your names into one of the groups already created with the name “Final Project ”. (If you don’t follow these instructions and create your own group, it will not be linked to the Final Project assignment and therefore you won’t be able to submit your project properly as a team.) Please join the groups in order starting with group #1 so that we don’t end up with gaps in the group numbers.

Piazza is a good place to post your interests and look for partners. After the date indicated, partners will be randomly assigned. (This is actually a better option in terms of preparing yourself for a work environment with colleagues that you don’t know well.) If you wish to be assigned to a partner before that date so you can get started earlier (recommended) please email me.

Once the groups are set up, we will ask for a short description of your project ideas, so start planning!

3.2.3 Topics

The topic you choose is open-ended… choose something that you are intereted in and genuinely curious about! Think of some questions that you don’t know the answer to. Next look for data that might help you answer those questions.

3.2.4 Data

The data can be pulled from multiple sources; it does not need to be a single dataset. Be sure to get data from the original source. For example, if you wish to work with data collected and distributed by the Centers for Disease Control, that is where you should go to access the data, not a third party that has posted the data. Do not use datasets that have been processed and cannot be traced to the source. For this reason, datasets from Kaggle are not good choices unless you can find the original source of the data.

3.3 Report format

Your report should be submitted as a bookdown book, which you can create from the final project template. Be sure to follow the instructions provided in the repo README. (There are other templates in my account. Make sure you use this one.)

Your report will include the following chapters, as shown in this rendered version of the final project template. (The links to the GitHub repo won’t work since the placeholders in the

Note that each chapter is an .Rmd file in the top level directory of the GitHub repo. You can have additional folders in your repo that will not be part of the bookdown book, for example, pre-processing scripts, experimental work, etc. However, do not include additional .Rmd files in the top level directory because they either will inadvertently end up in the bookdown book, or cause errors / warnings.

To turn in your final project, just submit the link to the rendered book (such as “https://jtr13.github.io/EDAVtemplate/” to the CourseWorks final project assignment.) After the deadline, do not make any changes to the main branch of your GitHub repo.

I. Introduction

Explain why you chose this topic, and the questions you are interested in studying. Provide context for readers who are not familiar with the topic.

(suggested: approximately 1/2 page)

II. Data sources

Describe the data sources: who is responsible for collecting the data? How is it collected? If there were a choice of options, explain how you chose.

Provide some basic information about the dataset: types of variables, number of records, etc.

Describe any issues / problems with the data, either known or that you discover.

(suggested: approximately 1 page)

III. Data transformation

Describe the process of getting the data into a form in which you could work with it in R. If your code does not lend itself to being including in the .Rmd file, provide a link to the folder or file(s) that contain(s) that code.

(suggested: approximately 1/2 page)

IV. Missing values

Describe any patterns you discover in missing values.

(suggested: 2 graphs plus commentary)

V. Results

You have a lot of freedom to choose what to do, as long as you restrict yourselves to exploratory techniques (rather than modeling / prediction approaches). In addition, your analysis must be clearly documented and reproducible.

Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. Take extra care to clean up your graphs, ensuring that best practices for presentation are followed, as described in the audience ready style section below.

Use subheadings as appropriate. See Todd Schneider’s blog posts for examples of thoughtful, informative subheadings.

(suggested: 5-10 graphs plus commentary, a plot with multiple facets counts as 1 graph)

VI. Interactive or animated component

Select one (or more) of your key findings to present in an interactive or animated format using D3 version 6. Be thoughtful about how you use interactivity and/or animation: using these features to add value in a way that would not be possible with a static graph.

Design with the goal of users walking away with a new understanding of the data and a clear understanding of the purpose of the visualization. If the tool is interactive, provide clear instructions on how the user should engage. Thoughtfulness is more important than technical prowess; in all aspects of the project, think quality, not quantity.

Interactive graphs must follow all of the best practices as with static graphs in terms of perception, labeling, accuracy, etc.

Please see this chapter on sharing your D3 code online for options for sharing the interactive portion of your project. You are encouraged to share experiences on Piazza to help classmates with the publishing process.

Note: the interactive component is worth approximately 25% of the final project grade. Therefore, do not spend 90% of your time on it… concentrate on the exploratory data analysis piece.

VII. Conclusion

Discuss limitations and future directions, lessons learned.

(suggested: approximately 1/2 page)

3.4 Reproducible workflow

It is important to think about a reproducible workflow from the very start. Everything you do should be guided by the question: Could others repeat what you’ve done based on the materials and explanations you’ve provided?

In practical terms, this means:

  • Use git/GitHub for version control, collaboration, and sharing code. All team members should have write access (i.e. be collaborators) on the same repo, either belonging to one team member or part of a GitHub organization that you create. With write access, think of the repo as “yours” and follow the “Your repo with branching” workflow.

  • In general, keep code, output, and analysis together in the .Rmd chapter files. If there are pieces of your project that can’t be included in the .Rmd, then document why this is the case and provide clear steps on how your process can be reproduced.

  • Use scripts for importing, cleaning and transforming data. As above, if it is not feasible or practical to do so, explain why and clearly document your workflow.

  • Document as you go. Copy and paste helpful links, Stack Overflow posts, whatever you use into comments in the places you use them. You will be happy you did so.

Recommended reading: Ten Simple Rules for Reproducible Computational Research. (Not all rules are applicable to this assignment but it will help you think about and develop a solid personal workflow plan going forward.)

3.5 Code

All of your code should be stored on GitHub. We will spend some time in class learning how to set up a Git/GitHub workflow. All group members should have write access to the same repo. You will need to discuss your work plan to avoid merge conflicts.

Keep your project organized. Create folders that work for you, such as data/raw, data/clean, preprocess, etc.

The static visualizations should be done in R, but other pieces, such as data importation and cleaning do not have to be.

3.6 Feedback

At any point, you may ask the TAs or the instructor (Joyce) for advice. Our primary role in this regard will be to provide general guidance on your choice of data / topic / direction. As always, you are encouraged to post specific questions to Piazza, particularly coding questions and issues. You may also volunteer to discuss your project with the class in order to get feedback–if you’d like to do this, email the instructor to schedule a date.

3.7 Peer review

After final projects are turned in, you will be asked write peer reviews of other projects. Each individual will be assigned two project groups to review, and instructions will be provided.

Note: part of the grade you receive for the class is based on the quality of review that you write, not on the feedback that your project receives. Your grade for the project (as for all other assignments for the class) will be determined solely by the instructor and TAs.

3.8 Writing advice

3.8.1 Start early

Don’t wait to start writing. Your overall project will undoubtedly be better if you give up trying to get that last graph perfect or the last bit of analysis done and get to the writing.

3.8.2 Tell it like it is

Be as intellectually honest as possible. That means pointing out flaws in your work, detailing obstacles, disagreements, decision points, etc. – the kinds of “behind-the-scene” things that are important but often left out of reports.

3.8.3 Hide code

The bookdown template is set up to do this for you. It includes an _common.R file, referenced in the _bookdown.yml file, which sets knitr chunk options for all chapters.

Since we will not view the code in the bookdown book, be sure that the View and Edit GitHub link buttons work properly.

3.8.4 Interpret your graphs

All graphs should be accompanied by textual description / interpretation.

3.8.5 Use simple formatting

Using RMarkdown makes it easy to combine code, text and graphs into a reproducible workflow. Fancy formatting is not its strength. Therefore, focus on the text and graphs, not the formatting. If you’re not sure if something is important to focus on or not, please ask.

3.9 Audience ready style

As we’ve discussed throughout the semester, standards are higher for clarity in graphs designed to be shared with others. While “good enough” is our standard for EDA, we need to go the extra mile for presentation. The following is checklist of items to address to make your graphs presentation ready. (You do not have to worry about these items for the EDA section.)

  • Title, axis labels, tick mark labels, and legends should be comprehensible (easy to understand) and legible (easy to read / decipher).
  • Tick marks should not be labeled in scientific notation or with long strings of zeros, such as 3000000000. Instead, convert to smaller numbers and change the units: 3000000000 becomes “3” and the axis label “billions of views”.
  • Units should be intuitive (An axis labeled in month/day/year format is intuitive; one labeled in seconds since January 1, 1970 is not.)
  • The font size should be large enough to read clearly. The default in ggplot2 is generally too small. You can easily change it by passing the base font size to the theme, such as + theme_grey(16) (The default base font size is 11).
  • The order of items on the axes and legends should be logical. (Alphabetical is usually not the best option.)
  • Colors should be color-vision-deficiency-friendly.
  • If categorical variable levels are long, set up the graph so the categorical variable is on the y-axis and the names are horizontal. A better option, if possible, is to shorten the names of the levels.
  • Err on the side of simplicity. Don’t, for example, overuse color when it’s not necessary. Ask yourself: does color make this graph any clearer? If it doesn’t, leave it out.
  • Test your graphs on nontechnical friends and family and ask for feedback.

Above all, have fun with it

3.10 Grading

We grade holistically, which means that we consider the project as a whole, we do not follow a rubric with dozens of items each worth a few points. Each project is different and this gives you the flexibility to devote more or less to various aspects of the project depending on what’s appropriate, within reason.

We are more impressed by quality than quantity.

In determining grades, we take the following into account:

  • Originality Are your questions thought-provoking? Do they encourage the reader to think about the topic in a new way?

  • Real world context Do your graphs and textual descriptions reflect a solid understanding of what your data mean? Is it clear why you are asking the questions that you are asking? Are your interpretations reasonable?

  • Reproducibility Did you provide all of your code in a manner that will be easy for the reader to rerun your analysis, and include an explanation for any pieces that cannot be reproduced? Is your code clear?

  • Multidimensionality Do you examine multidimensional relationships and present them clearly?

  • Choice of graph forms Are your graph forms good choices for your data?

  • Parameters / design decisions Have you made good choices in parameters, color, etc.?

  • Standards Do your graphs meet audience ready style standards?

  • Interactive part Are the instructions clear? How well does the interactive component connect with the goals of the project? Does it help the reader understand the main conclusions?

  • Technical competence Do your links work? Are your graphs displaying properly?

3.11 Resources

“Tidy Tuesday Screencast: analyzing college major & income data in R” David Robinson explores a dataset in R live, without looking at the data in advance. This may be helpful in figuring out how to get started.