How to set up, configure, and work with git
and GitHub in the practice of data science.
Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers who are collaboratively developing source code during software development. Its goals include speed, data integrity, and support for distributed, non-linear workflows. Wikipedia
GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. It currently hosts work by approximately 100M developers. Wikipedia
Data aggregation, cleaning, pipelines and ML models all rely on software in order to operate. Responsible software management depends on well-managed code, versioning, prioritizing bugs, features, and user issues. Further, modern platforms and infrastructure tend to favor code-driven tests, builds, deployment, and management.
All of which is to say: Code is fundamental to our work, and it would be both risky and impractical to not use source control.
- Setup
- Install and set up
git
- Authenticate
git
to GitHub - Basic configuration
- Troubleshooting
- Install and set up
- Creating and managing a repository
- Create a repository locally
- Create a repository in GitHub
- Adding or removing collaborators
- Source control basics
- Diff
- Status
- Add
- Commit
- Push/Pull
- Fetch
- Log
- Branches, Forks, and Merges
- Branches
- Forks
- Fetch from Upstream
- Merges and Pull Requests
- Issues
- Advanced Git/GitHub Features
- Stash
- Signing commits
- Reset and Revert
- Rebase
- Cherry-pick
- Renaming
origin
- Bonus
- GitHub Actions
- About
- Credentials & Secrets
- Example 1 - Build software upon a push
- Example 2 - Build and deploy a container