-
Notifications
You must be signed in to change notification settings - Fork 30
/
milestone1.Rmd
281 lines (203 loc) · 17.8 KB
/
milestone1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
title: "Milestone 1"
descriptions: |
GitHub repository, statistical question, data & rough draft of analysis in one monolithic literate code document, reproducible virtual environment
output:
distill::distill_article:
toc: true
---
## Overall project summary
In this course you will work in assigned teams of three or four
(see group assignments in [Canvas](https://canvas.ubc.ca))
to answer a predictive question using a publicly available data set that will allow you to answer that question.
To answer this question,
you will perform a complete data analysis in R and/or Python,
from data import to communication of results,
while placing significant emphasis on reproducible and trustworthy workflows.
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook,
to a fully reproducible and robust data data analysis project, comprised of:
- a well documented and modularized software package and scripts written in R and/or Python,
- a data analysis pipeline automated with GNU Make,
- a reproducible report powered by Quarto,
- a containerized computational environment created and made shareable by Docker, and
- remote version control repositories on GitHub for project collaboration and sharing,
as well as automation of test suite execution and documentation and software deployment.
An example final project from another course (where the project is similar) can be seen here:
[Breast Cancer Predictor](https://github.com/ttimbers/breast_cancer_predictor_py)
## Milestone 1 summary
In this milestone, you will:
1. Draft a team work contract
2. Set-up a public GitHub repository
3. Create an appropriate file and directory structure for a data analysis project
4. Perform the data analysis in a literate code document
5. Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv) - hint do this as you create the analysis, don’t wait until the end!
6. Manage issues professionally
An example project milestone 1 is available here:
[Breast Cancer Predictor in Python v0.0.4](https://github.com/ttimbers/breast_cancer_predictor_py/tree/v0.0.4)
### 1. Draft a team work contract
This document will govern the working relationships on your team, and if done correctly,
it should help you manage and resolve any issues that might arise when working in a group.
A sample team work contract from a past data science project can be found [here](sample-team-work-contract.html).
A team work contract communicates specifically **how** the core group of people who are working together
and gives detail about the logisitics of working together and the expectations you have for each other.
Some aspects of the team work contract could be:
- How will work be distributed in a fair and equitable way?
- What are the expected work hours for the project?
- How often will group meetings occur ([here is a nice article](http://third-bit.com/2018/05/11/meetings.html) on meetings)?
- Will you have meeting agendas and minutes? If so, what will be the system for rotating through these responsibilities?
- What will be the style of working?
- Will you start each day with [stand-ups](https://www.atlassian.com/agile/scrum/standups), or submit a summary of your contributions 4 hours before each meeting? or something else?
- What is the quality of work each team member expects from themselves and each other?
- When are team members not available (e.g., evenings and Sundays because of family obligations).
- And any other similar things that govern your working relationships.
Use this opportunity to use your prior knowledge/experience to improve your
teamwork, communication, leadership, and organizational skills.
You will need all of these for your capastone projects (and beyond)!
**Note - this document is fairly personal and does NOT need to reside in your public GitHub.com repo. Instead you can prove that you created this by submitting a copy of it to Gradescope under the Milestone 1 assignment.**
#### Activity to start drafting the team work document
1. (5 min) Each person quietly writes down three things that are important to them
for the team work contract.
2. (5 min) Each person takes a turn shares their three ideas with the group.
3. (10 min) Group discussion to decide which of the ideas presented should
be incorporated into the contract
(try to fairly distribute the inclusion of ideas from all teammates).
Also discuss if there is anything missing from the contract.
### 2. Set-up a GitHub repository
Fill out this survey to provide us your GitHub.com username: https://ubc.ca1.qualtrics.com/jfe/form/SV_8cCRmyxpbch5uxU
After we add you to the [DSCI 310 2024 organization on GitHub.com](https://github.com/DSCI-310-2024), set-up a public GitHub repository for your group under that organization being sure to add all team members as collaborators. Once you have decided on a project question and data set, make sure your repository has a name that is relevant to that question. *Note: if you need to change the name of your project repository later, you can!*
We are using github.com and public repositories because we will eventually be using some fancy tooling (*e.g.,* GitHub Actions) in this project. For this to work, all your work for this project (including scripts) should be developed and live in this repository on GitHub.com. Additionally, this will help you build a portfolio of your work you can share with others.
### 3. Create an appropriate file & directory structure for a data analysis project
For this project, you are expected to follow the best practices we have learned about in regards to project directory and file structures. This should be reflected in your GitHub repository. Specifically for this milestone, please ensure that you include **at least** the following 7 files and directories in your project's GitHub repository:
- `README.md`
- `CODE_OF_CONDUCT.md`
- `CONTRIBUTING.md`
- `data/`
- `<analysis>.ipynb` (give it a meaningful name!)
- `environment.yml` (or the corresponding `renv` files)
- `LICENSE.md`
The description of each of them follows.
#### i. `README.md`
In the main `README.md` file for this project you should include: - the project title - the list of contributors/authors - a short summary of the project (view from 10,000 feet) - how to run your data analysis - a list of the dependencies needed to run your analysis - the names of the licenses contained in `LICENSE.md`
**Note - this document should live in the root of your public GitHub.com repo.**
#### ii. `CODE_OF_CONDUCT.md`
In an attempt to create a safe, positive, productive, and happy community, many organizations and developers create a code of conduct for their projects. A code of conduct in a Data Science project informs others of social norms, acceptable behaviour and general etiquette. It is more outward facing than the team work contract discussed above. We recommend [Project Include](https://projectinclude.org/writing_cocs) for a comprehensive guide to writing an effective Code of Conduct.
At minimum, we believe that a good/effective code of conduct should include:
- A Statement on diversity and inclusivity
- Details on expected, and unacceptable behaviour
- A list of consequences for unacceptable behaviour
- A procedure for reporting and dealing with unacceptable behaviour
##### Sample Codes of Conduct:
- [UBC Data Science 100 CoC](https://github.com/UBC-DSCI/dsci-100/blob/master/CODE_OF_CONDUCT.md)
- [idocde Coc](http://www.idocde.net/pages/35)
- [Python Community CoC](https://www.python.org/psf/conduct/)
- [Tidyverse CoC](https://github.com/tidyverse/tidyverse.org/blob/master/CODE_OF_CONDUCT.md)
- [Pandas CoC](https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md)
- [Vox Media CoC](http://code-of-conduct.voxmedia.com/?_ga=1.62865454.308680892.1455143920)
**Note 1 - this document should live in the root of your public GitHub.com repository.**
**Note 2 - If you all, or even part, of someone else's code of conduct, be sure to make this clear by attributing them.**
#### iii. `CONTRIBUTING.md`
It is a good practice to include information about how others, outside the core team, can contribute to your project somewhere in your repository. This is typically done as a separate file in the repository called `CONTRIBUTING.md`. Here are some examples of this file:
- [Altair](https://github.com/altair-viz/altair/blob/master/CONTRIBUTING.md)
- [dplyr](https://github.com/tidyverse/dplyr/blob/master/.github/CONTRIBUTING.md)
- [Atom editor](https://github.com/atom/atom/blob/master/CONTRIBUTING.md) (very comprehensive)
**Note 1 - this document should live in the root of your public GitHub.com repository.**
**Note 2 - If you all, or even part, of someone else's code of conduct, be sure to make this clear by attributing them.**
#### iv. `data/`
The `data` directory should contain a copy of the downloaded data (it should get written there by code). An exception to this can be made if the data set is extremely large (\> 100 MB per file). In such cases, please reach out to the instructional team to make an alternative plan.
#### v. `analysis.ipynb` / `analysis.qmd`
For this milestone, the analysis code and narration should be contained within a single literate code document (e.g., Jupyter notebook, RMarkdown file, or Quarto document). This document should not actually be named `analysis.***`, but have a more descriptive title, related to the project title.
#### vi. `environment.yml` (or the corresponding `renv` files)
Include the necessary files (e.g., `environment.yml`) so others can reproduce your work on their machine.
**Note - this document should live in the root of your public GitHub.com repository.**
#### viii. `LICENSE.md`
License's tell others how they may (or may not) use your work. We will learn more about licenses later in the course, and for now we recommend using an [MIT license](https://opensource.org/licenses/MIT) for the project code and a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)](https://creativecommons.org/licenses/by-nc-nd/4.0/) license for the project report. As long as your group agrees, these can be changed later in the course when you learn more about licenses. If you want to spend some time choosing different licenses now, we can recommend these two websites to help get you started:
- https://choosealicense.com/
- https://chooser-beta.creativecommons.org/
**Note - this document should live in the root of your public GitHub.com repository.**
### 4. Perform the data analysis in a literate code document
The content of this analysis project should be a narrated analysis that asks
and answers a predictive question using a classification
or regression method taught in the prerequisite course, DSCI 100.
For milestone 1,
the code and analysis narrative should be contained within a single Jupyter notebook
(e.g., `.ipynb` file), RMarkdown file (e.g., `.Rmd` file), or Quarto document (e.g., `.qmd` file).
The analysis narrative should be rich,
and at the level of the final project from DSCI 100.
Either R or Python can be used to do this.
The data for the project should be publicly available,
and clearly licensed to be shared and used openly on the internet.
We strongly suggest you avoid using data sets where authentication is needed for access
(e.g., Kaggle) as this adds another layer of complexity when making these projects reproducible.
**Note - You are expected to create an original data analysis for this project.**
**You are not allowed to reuse an analysis from another course.**
#### Possible data set sources
- [City of Vancouver Open Data Portal](https://opendata.vancouver.ca/pages/home/)
- [BC Data Catalogue](https://catalogue.data.gov.bc.ca/)
- [Government of Canada open data](https://open.canada.ca/en/open-data)
- [GitHub repo of awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)
- [UCI Machine Learning Repository: Data Sets](https://archive-beta.ics.uci.edu/ml/datasets)
- [tidytuesday data set](https://github.com/rfordatascience/tidytuesday)
- [Google Data set search engine](https://toolbox.google.com/datasetsearch)
- [fivethirtyeight R package](https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html) (if you use this, see the note below!)
**Note: if you choose a data set from the fivethirtyeight R package, you cannot copy their scientific question, visualizations or methods from the original FiveThirtyEight articles from where the data sets were first reported on. Finally, with the fivethirtyeight R package data sets, we want you to get practice reading them using the `read_*` functions in R or Python, so please use the versions of the data sets listed here: https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw**
Your analysis should have the following sections:
- **Title**
- **Summary**
- **Introduction:**
- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and describe the dataset that was used to answer the question
- **Methods & Results:**
- describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
- your report should include code which:
- loads data from the original source on the web
- wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned classification or clustering analysis
- performs a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis
- creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
- performs classification or regression analysis
- creates a visualization of the result of the analysis
- *note: all tables and figure should have a figure/table number and a legend*
- **Discussion:**
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?
- **References:**
- at least 4 citations relevant to the project (format is your choose, just be consistent across the references).
### 6. Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv)
For this project, we will eventually be making the computation environment reproducible through containerization with Docker. For milestone 1 however, we will start with using a virtual environment as our first step towards this. You can use either `conda` or `renv` for this. Be sure to include the necessary files (e.g., `environment.yml`) so others can reproduce your work. Be sure to include documentation of how to use the environment with your project as well in the `README.md` file.
Finally, make sure the versions of all software and software packages used in the analysis
are recorded in the file that specifies the environment.
### 6. Manage issues professionally
Manage issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task in the project. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and **each team member should be responsible for an approximately equal portion of the code.**
To ensure the TA's can grade your project board, change the visibility from the default private setting to the public setting. To change the visibility of the project board do the following:
1. Go to the project board and click on the three vertical dots on the right-hand side of the board and select settings
2. Scroll to the bottom of the Settings page to the "Danger zone" and under visibility select "Public".
## Submission Instructions
You will submit two URLS's to Canvas in the provided text box for milestone 1:
1. the URL of your project's GitHub.com repository
2. the URL of a GitHub release of your project's project's GitHub.com repository
### Creating a release on GitHub.com
Just before you submit the milestone 1,
create a release on your project repository on GitHub and name it
`0.0.1` ([how to create a release](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release)).
This release allows us and you to easily jump to the state of your repository at the time of submission for grading puroposes,
while you continue to work on your project for the next milestone.
## Expectations
- **Everyone should contribute equally to all aspects of the project**
**(e.g., code, writing, project management).**
**This should be evidenced by a roughly equal number of commits,**
**pull request reviews and participation in communication via GitHub issues.**
- After the repository is set-up,
each group member should work in a
[GitHub flow workflow](https://docs.github.com/en/get-started/quickstart/github-flow);
where they create branches for each feature or fix,
which are reviewed and critiqued by at least one other teammate
before the the pull request is accepted.
- You should be committing to git and pushing to GitHub.com every time you work on this project.
- Git commit messages should be meaningful. These will be marked.
It's OK if one or two are less meaningful, but most should be.
- Use GitHub issues to communicate to their team mate (as opposed to email, WhatsApp, WeChat, etc).
- Your question, analysis and visualization should make sense. It doesn't have to be complicated.
- You should use proper grammar and full sentences in your written documents.
Point form may occur, but should be less than 10% of your written documents.