Git tutorial

Getting Started: Configure Git on a new machine

Getting Started: configure Git

💡 These steps you would need to do for every computer on which you start using git.

  • Make your identity known:

    git config --global user.name "John Doe"
    git config --global user.email johndoe@example.com
    
  • Set up SSH access to Git repositories

Set up SSH access to Git repositories

Setup SSH access to Git repositories

Why use SSH keys in Git?

  • You need to authenticate for every git action on your local repository (e.g. push/pull).
  • The authentication method depends on whether you select an HTTPS or SSH remote URL when cloning the Git repository.
  • Using SSH keys means there is no need to provide the username and password for each action, for example, pushing or pulling changes or signing commits.
Set up SSH access to Git repositories

How to set up SSH access to Git repositories

1. Generate your SSH key pair

  • execute the following to begin the key creation:

    ssh-keygen -t ed25519 -C "<comment>"
    

2. Locating your SSH Keys

  • You will be prompted to "Enter a file in which to save the key.". You can specify a file location or press “Enter” to accept the default file location.

    > Enter a file in which to save the key (/home/username/.ssh/id_ed25519): [Press enter]
    
Set up SSH access to Git repositories

3. (Optional) Protecting your SSH keys

  • The next prompt will ask for a secure passphrase.
    💡 You can press enter - then the passphrase will be left empty. Otherwise, every time you’re pushing/pulling you’d be queried to provide the passphrase

    > Enter passphrase (empty for no passphrase): [Type a passphrase]
    > Enter same passphrase again: [Type passphrase again]
    
  • This will create two files in the chosen directory:

    • private key: ~/.ssh/id_ed25519
    • public key: ~/.ssh/id_ed25519.pub
Set up SSH access to Git repositories

4. Activating the SSH agent and adding keys to it

  • Add the new SSH key to the ssh-agent:
    💡 Before adding the new SSH key to the ssh-agent first ensure the ssh-agent is running by executing:

    eval $(ssh-agent -s)
    
  • Once the ssh-agent is running the following command will add the new SSH key to the local SSH agent:

    ssh-add ~/.ssh/id_ed25519
    
  • Now, copy the public key by e.g. executing cat and copying the output:

    cat ~/.ssh/id_ed25519.pub
    
Set up SSH access to Git repositories

5.Adding a key to Gitlab/Gitlab

  • Gitlab: Go to left menu → Preferences → SSH Keys → Add new key
  • In the key field, paste the content of ~/.ssh/id_ed25519.pub
  • Press “Add key” at the bottom
Set up SSH access to Git repositories

Creating and cloning repositories

Creating and cloning repositories

  • Create your first repository (e.g. in Gitlab)

Creating and cloning repositories
  • Clone your repository to a local PC:

    • Copy the link to the repository by pressing "code""clone with ssh""Copy URL"
    • In the terminal:
      • Go to the location where you want to keep your project and execute
        git clone [URL_to_repo]
        
      • e.g:
        git@gitlab.com:mockusername/cnn_training.git
        
      💡 Note that, in order to communicate changes to git via the terminal, you always need to be in the folder of the repo, so run cd cnn_training.
Creating and cloning repositories
  • You’ll see that you have in the project's directory the same files as in the remote git repository (in this case, only README.md was created automatically):

    $ git clone git@gitlab.com:mockusername/cnn_training.git
    Cloning into 'cnn_training'...
    remote: Enumerating objects: 3, done.
    remote: Counting objects: 100% (3/3), done.
    remote: Compressing objects: 100% (2/2), done.
    remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
    Receiving objects: 100% (3/3), done.
    $ cd cnn_training/
    $ ls -l
    total 8
    -rw-rw-r-- 1 username username 6194 Aug 12 11:19 README.md
    
Creating and cloning repositories

Git workflow in practice: first changes, commits & push

Git workflow in practice: first changes, commits & push

  • Now let’s make our first changes and commit them!

    • Let’s create a file train.py :

      def train():
          pass
      
      if __name__== "__main__":
          pass
      
Git workflow in practice: first changes, commits & push
  • Examine which files were changed/staged for commit/ignored

    • To do so, run git status:

      $ git status
      On branch main
      Your branch is up to date with 'origin/main'.
      
      Untracked files:
          (use "git add <file>..." to include in what will be committed)
          train.py
      
      nothing added to commit but untracked files present (use "git add" to track)
      
      
  • You could see the newly added file on the “Untracked files” list.

Git workflow in practice: first changes, commits & push
  • Let’s add the newly added file to the staging area:
    git add train.py
    
  • Now, if we run git status again, we’ll see that this file is added to the staging area.
    $ git status
    On branch main
    Your branch is up to date with 'origin/main'.
    
    Changes to be committed:
        (use "git restore --staged <file>..." to unstage)
        new file:   train.py
    
Git workflow in practice: first changes, commits & push
  • Let’s create a commit

    • To create a commit, run git commit -m "draft of train script".

      💡 One commit represents a set of changes (such as file modifications, additions, or deletions) you recorded in your repository. Each commit serves as a checkpoint that you can revert to or reference later, providing a detailed history of your project's development.

Git workflow in practice: first changes, commits & push
  • Push the commit

    • Finally, let’s push the created commit from the local to a remote repository:

      git push
      

      💡 git push transfers your commits from your local machine to a remote server (aka git repository), making them available to others who have access to that remote repository.

Git workflow in practice: first changes, commits & push
  • 💡 In other local repositories: pull the commit

    • If you have the repository cloned somewhere else (e.g. on the cloud machine), you need to upload there the changes from the remote git server. You can do this by running git pull command:

      center

Git workflow in practice: first changes, commits & push

Inspecting the changes

Inspecting the changes

  • Scenario: add new files and modify already excisting ones

    💡 Let’s say we want to add data preparation to our training (e.g. loading the data, splitting it into train/test etc).

    • Let's create a new script called dataset.py:

      def prepare_dataset():
          pass
      
    • Modify our train.py script as follows:

      from dataset import prepare_dataset
      def train():
          pass
      
      def main():
          prepare_dataset()
          train()
      
      if __name__== "__main__":
          main()
      
Inspecting the changes

Overview of your project state with git status

  • When we inspect changes with git status, we only see high-level overview:

    $ git status
    On branch main
    Your branch is ahead of 'origin/main' by 1 commit.
        (use "git push" to publish your local commits)
    
    Changes not staged for commit:
        (use "git add <file>..." to update what will be committed)
        (use "git restore <file>..." to discard changes in working directory)
        modified:   train.py
    
    Untracked files:
        (use "git add <file>..." to include in what will be committed)
        dataset.py
    
Inspecting the changes

Inspecting modifications line-by-line: git diff

  • Alternatively to git status, you can see how exactly the local version of the code is different from the version on the remote server by running git diff

  • To show the differences for the specific file {filename}, run git diff {filename}:

    git diff train.py
    
Inspecting the changes
  • to show changes for the whole folder, run git diff:
    $ git diff
    diff --git a/train.py b/train.py
    index ac0b191..a71ce1a 100644
    --- a/train.py
    +++ b/train.py
    @@ -1,5 +1,11 @@
    +from dataset import prepare_dataset
        def train():
            pass
    
    +def main():
    +    prepare_dataset()
    +    train()
    +    pass
    +
        if __name__== "__main__":
    -       pass
    \ No newline at end of file
    +       main()
    \ No newline at end of file
    
Inspecting the changes

git diff: staged vs unstaged changes

  • ⚠️ git diff with no other arguments will show you the files that you changed but not yet staged:
    • e.g. if we now add our changes to the staging area and run git diff again, it would be empty:

      $ git add train.py dataset.py
      $ git diff
      
      # nothing is displayed
      
Inspecting the changes
  • If instead you want to see the differences that you’ve already staged & that will go into your next commit, you can use git diff --staged(or git diff HEAD):

    $ git diff --staged
    diff --git a/dataset.py b/dataset.py
    new file mode 100644
    index 0000000..f2f21c0
    --- /dev/null
    +++ b/dataset.py
    @@ -0,0 +1,2 @@
    +def prepare_dataset():
    +    pass
    \ No newline at end of file
    diff --git a/train.py b/train.py
    index 7b7be23..9026856 100644
    --- a/train.py
    +++ b/train.py
    @@ -1,5 +1,10 @@
    +from dataset import prepare_dataset
    def train():
        pass
    
    +def main():
    +    prepare_dataset()
    +    train()
    +
    if __name__== "__main__":
    -    pass
    +    main()
    \ No newline at end of file
    
Inspecting the changes

Advanced staging techiques

Advanced staging techiques

💡 Scenario:

  • For the sake of this example, let’s move our 2 scripts to the src folder & add another bash script to the directory run_train.sh with the following content:

    echo "Running the training loop"
    python src/train.py
    
Mastering git add: files, patterns, and directories

Mastering git add

  • Adding specific files:
    • you can list multiple files separated by spaces in a single git add command:

      git add src/train.py src/dataset.py
      
    • alternatively, run git add multiple times:

      git add src/train.py
      git add src/dataset.py
      
Mastering git add: files, patterns, and directories
  • Staging all changed files

    • You can stage all modified and new files by using the . (dot) syntax.

    • This command stages all changes in the current directory and its subdirectories.

      $ git add .
      $ git status
      On branch main
      Your branch is ahead of 'origin/main' by 1 commit.
          (use "git push" to publish your local commits)
      
      Changes to be committed:
          (use "git restore --staged <file>..." to unstage)
          new file:   run_train.sh
          new file:   src/dataset.py
          new file:   src/train.py
          deleted:    train.py
      
Mastering git add: files, patterns, and directories

Stage updates to tracked files only:

  • Use git add -u to stage only updates to files that are already tracked:
    • ✅ will stage modified & deleted files
    • ❌ Does not stage new untracked files
  • This is helpful when you want to commit your edits, but still exclude new files you're not ready to commit yet
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
        modified:   train.py

Untracked files:
(use "git add <file>..." to include in what will be committed)
        dataset.py

no changes added to commit (use "git add" and/or "git commit -a")
$ git add -u
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
(use "git restore --staged <file>..." to unstage)
        modified:   train.py

Untracked files:
(use "git add <file>..." to include in what will be committed)
        dataset.py
Mastering git add: files, patterns, and directories
  • Staging files by pattern

    • You can use wildcard patterns to stage multiple files that match a specific pattern, e.g. git add *.py

    • This example stages all files with a .py extension in the current directory.:

      $ git add *.py
      $ git status
      On branch main
      Your branch is ahead of 'origin/main' by 1 commit.
          (use "git push" to publish your local commits)
      
      Changes to be committed:
          (use "git restore --staged <file>..." to unstage)
          new file:   src/dataset.py
          new file:   src/train.py
          deleted:    train.py
      
      Untracked files:
          (use "git add <file>..." to include in what will be committed)
          run_train.sh
      
      
Mastering git add: files, patterns, and directories
  • Staging files in a directory

    • You can stage all files within a specific directory git add path/to/directory:

      $ git add src/
      $ git status
      On branch main
      Your branch is ahead of 'origin/main' by 1 commit.
          (use "git push" to publish your local commits)
      
      Changes to be committed:
          (use "git restore --staged <file>..." to unstage)
          new file:   src/dataset.py
          new file:   src/train.py
      
      Changes not staged for commit:
          (use "git add/rm <file>..." to update what will be committed)
          (use "git restore <file>..." to discard changes in working directory)
          deleted:    train.py
      
      Untracked files:
          (use "git add <file>..." to include in what will be committed)
          run_train.sh
      
      
Moving and removing files: git mv and git rm

Moving and removing files: git mv and git rm

  • Removing files: git rm

    • Use git rm to delete files from both your working directory and git repository:
      • You can rename or move a file using:
        git rm wrong_file.py
        
      • This wound be equivalent to:
        rm wrong_file.py
        git add wrong_file.py
        
Moving and removing files: git mv and git rm
  • Moving or renaming files

    • Git doesn't track file renames explicitly — it infers them from content similarity.
    • Still, Git offers convenience commands for moving and renaming files.
    • You can rename or move a file using git mv:
      git mv old_name.py new_name.py
      
    • This is equivalent to:
      mv old_name.py new_name.py
      git add old_name.py new_name.py
      
Avoid adding unnecessary files: .gitignore

  • You’ve seen that you could add all files that are located in a folder or that follows a specific pattern.
  • That could simplify the process, but you need to be careful not to add unnecessary files (see here).
Avoid adding unnecessary files: .gitignore

Avoid committing unnecessary file

  • To keep a Git repository clean, you need to avoid committing unnecessary files

  • Key reasons to avoid committing unnecessary files:

    • Limited repository size in github and gitlab:
      • Up to 100 MB files, up to 10GB for repository
    • The smaller the size, the faster are cloning/pulling/pushing
    • Enhanced collaboration: Ensures only relevant files are shared, and prevents merge conflicts.
    • Security: Protects sensitive information.
Avoid adding unnecessary files: .gitignore

Example of unnecessary files are:

  • IDE configurations (e.g., .idea, .vscode)
  • automatically created by notebooks folders: __pycache__, ipynb_checkpoints
  • sensitive configuration files (API keys, passwords)
  • log files from slurm
  • your virtual environment (e.g. .venv)
  • …and much more
How to avoid adding unnecessary files?
  • Remove files from the staging area:

    • To remove all files added to the staging area, run
      git restore --staged .
      
  • Create & use a .gitignore file!

    • the .gitignore file is a special file in a Git repository that tells Git which files or directories to ignore.
    • This means that Git will not track changes to these files, they won't be added to the staging area, and they won't be included in commits.
Common patterns for .gitignore
  • Ignore specific files:

    • This ignores a specific run_train.sh file.

      run_train.sh
      
  • Ignore directories:

    • This ignores the data/ directory and all its contents.

      data/
      
  • Ignore files by extension / pattern:

    • This pattern ignores all files with the .txt extension no matter where they are in the directory structure.

      *.txt
      
Let’s see .gitignore in action!
  • create folder data and create 2 files there: train.txt and text.txt.

  • Additionally, create config.yaml in the root directory:

  • create .gitignore file with the following content:

    data/*
    *.yml
    
Let’s see .gitignore in action!
  • Now, add all root repository to the staging are and see what happens:

    $ ls -l
    total 20
    -rw-rw-r-- 1 username username    0 Aug 12 13:37 config.yml
    drwxrwxr-x 2 username username 4096 Aug 12 13:37 data
    -rw-rw-r-- 1 username username 6194 Aug 12 11:19 README.md
    -rw-rw-r-- 1 username username   52 Aug 12 12:56 run_train.sh
    drwxrwxr-x 2 username username 4096 Aug 12 12:14 src
    $ git add .
    $ git status
    On branch main
    Your branch is ahead of 'origin/main' by 1 commit.
        (use "git push" to publish your local commits)
    
    Changes to be committed:
        (use "git restore --staged <file>..." to unstage)
        new file:   .gitignore
        new file:   run_train.sh
        new file:   src/dataset.py
        new file:   src/train.py
        deleted:    train.py
    
  • Even though we have config.yml and data folder locally, it was ignored and not added to the staging area.
Let's prepare for the exercise!
  1. (If not already) Configure git on your machine as discussed at the beginning

  2. Head to the repository of this workshop: https://gitlab.mpcdf.mpg.de/mpcdf/training/software-engineering-in-python

  3. Clone the repository locally:

    git clone git@gitlab.mpcdf.mpg.de:mpcdf/training/software-engineering-in-python.git
    
Key takeaways

Key takeaways

  • Commit early, commit often: small, frequent commits help track changes and make it easier to spot issues.
  • Push regularly: push your code often to keep backups.
  • Better done than perfect: it’s okay to push code that’s not perfect, progress is what matters.
  • Git is your safety net: you can always revert changes if something goes wrong.
  • Write clear commit messages: good messages make it easier for everyone to understand the changes.
  • Use .gitignore to avoid adding unnecessary files
  • Learning takes time:
    • begin by using git for small projects in a linear fashion to get familiar with the basics.
    • get comfortable with git through practice and experimentation with more complicated things like branching, merging, etc.