Version Control: GitHub

Abstract:

What is Version Control?

Version control is a system that helps you track changes in your code or data over time. It lets you:

  • Save different versions of your project.
  • Collaborate with others without overwriting each other’s work.
  • Revert to earlier states when something breaks.
  • Understand the evolution of your project through clear history.


The most widely used version control system today is Git . And the most popular platform for hosting Git repositories is GitHub .


What is Git?

Git is a free and open-source version control system created by Linus Torvalds (the creator of Linux). Git runs locally on your machine and helps manage your project's history.

Git works by recording snapshots of your files (called commits), allowing you to move between different versions or branches of your code.


What is GitHub?

GitHub is a cloud-based platform built around Git. It allows you to:

  • Store your Git repositories online
  • Collaborate with other developers
  • Use powerful tools like pull requests, issues, wikis, and CI/CD workflows

You can think of Git as the engine , and GitHub as the car that helps you drive and manage it with others on the road.


Why Data Scientists Should Learn Git & GitHub

  • Collaborate with teammates (engineers, analysts, or fellow data scientists)
  • Keep track of experiments and model versions
  • Share notebooks or datasets
  • Make your work reproducible
  • Contribute to open-source data science projects

Core Concepts and Terminology

Term Description
Repository (repo) A folder or project tracked by Git
Commit A snapshot of your changes
Branch A parallel version of your codebase
Merge Combining changes from one branch into another
Clone Downloading a GitHub repository to your local machine
Push/Pull Sending or receiving changes between local and remote
Fork Creating a copy of someone else's repository (on GitHub)

Getting Started with Git and GitHub

Step 1: Install Git

Verify installation:

git --version

Step 2: Set Up Git

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Step 3: Create a GitHub Account

Go to https://github.com and sign up.


Step 4: Create a New Repository on GitHub

  1. Click the + icon in the top-right corner → New repository
  2. Give it a name (e.g., my-first-project )
  3. Choose public or private
  4. Add a README (optional but recommended)
  5. Click Create repository

Step 5: Connect Local Project to GitHub

Option A: Clone an Existing Repository
git clone https://github.com/username/repo-name.git
cd repo-name
Option B: Create a Local Project and Push to GitHub
mkdir my-project
cd my-project
git init
touch README.md
git add README.md
git commit -m "Initial commit"
git remote add origin https://github.com/username/repo-name.git
git push -u origin main

Common Git Commands You’ll Use

Command Purpose
git init Start a new Git repository
git status Show current changes
git add filename Stage a file for commit
git commit -m "message" Commit staged changes
git log Show commit history
git push Send changes to GitHub
git pull Get latest changes from GitHub
git branch Show or create branches
git checkout branch-name Switch branches
git merge branch-name Merge another branch into current one

Visualizing the Workflow

[Your Code] → git add → git commit → git push → [GitHub Repository]

When working with others:

[GitHub Repo] ← git pull ← [Teammate’s Code]

Working with Branches

Branching is key for working on features or experiments without disturbing the main project.

git checkout -b feature-model-tuning
# Work and commit on this branch
git checkout main
git merge feature-model-tuning

Best Practices for GitHub in Data Science

  • Commit often with meaningful messages
  • Use .gitignore to avoid tracking large or unnecessary files (e.g., .DS_Store , .ipynb_checkpoints , data/ )
  • Don’t push raw data or secrets (like API keys) to GitHub
  • Use branches for experiments and merge only when stable
  • Include a README to explain your project

Using GitHub with Jupyter Notebooks

GitHub can render .ipynb files directly, allowing you and collaborators to view notebooks online without running them locally.

To share notebooks:

  • Push them to a GitHub repo
  • Share the link

Learn More:

https://docs.github.com/en/get-started/using-git/about-git

https://www.youtube.com/watch?v=8JJ101D3knE

https://www.youtube.com/watch?v=mJ-qvsxPHpY

https://education.github.com/git-cheat-sheet-education.pdf


Leave a Comment

Comments

Are You a Physicist?


Join Our
FREE-or-Land-Job Data Science BootCamp