There are a myriad of reasons you might want to do this. Maybe someone thought it was a good idea to commit binaries and now your repo is Gigabytes in size. Maybe someone accidentally or otherwise commited a/many password(s), and now even though you've removed them, they are still retrievable through the commit history.
Either way it's now up to you to do the one thing git was designed not to do: Forget.
You will be making changes to the history of the repo. This will cause problems for others working on the repo in parallel to you.
You're doing some dangerous and potentially permanent damage to your repo!
Make as many damn backups as feels safe and get them off your machine!
First, you're going to need to install a git filtering tool. There have been many only the years but as of 2021 git officially recommends filter-repo.
filter-repo allows a method to clean the history of your repo. Although there are other solutions that do the same time, filter-repo is much faster and more stable.
To Install on Ubuntu
sudo apt update
sudo apt install snapd
snap install git-filter-repo --edge
git filter-repo
Run filter-repo against the directories or files you want to remove. You can repeatedly run against one directory/file at a time, OR you can run against multiple directies with multiple --path flags. You can also use the --dry-run flag to just output what would be changed.
Ex.
# Single directory
git filter-repo --path path/within/repo/ --invert-paths
# Test run against single file
git filter-repo --path path/within/repo/myscript.java --invert-paths --dry-run
# Specific file & directory
git filter-repo --path path/within/repo/myscript.java --path path/to/another/dir/ --invert-paths
This is the danger zone. Check your commands twice before you run them.
You probably know most of the large files in your repo, but the following script can help you sort all blobs by size.
Paste the following in to a .sh file and make it exectutable.
#!/bin/bash -e
function main {
local tempFile=$(mktemp)
# work over each commit and append all files in tree to $tempFile
local IFS=$'\n'
local commitSHA1
for commitSHA1 in $(git rev-list --all); do
git ls-tree -r --long "$commitSHA1" >>"$tempFile"
done
# sort files by SHA1, de-dupe list and finally re-sort by filesize
sort --key 3 "$tempFile" | \
uniq | \
sort --key 4 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
}
main
Now cd into the base directory of your repo and run the shell script piping it into a text file. It may take a few minutes to run. But you should have a nicely sorted list of large files once you're all done.
cd /opt/my_c00l_repo
~/example.sh > /tmp/lgfiles.txt
Now run filter-repo against the directories or files question.
This is a little more in depth. Most of the time when secrets are hard coded into the repo, development work is needed to get them out.
First you're going to need some tools for finding these hidden secrets. Suggested ones are:
grep -rnw '/path/to/somewhere/' -e 'pass'Use these tools to comprise a list of files that need to be changed. Check with developers whether you can simply swap out the secrets for variables, or whether more complicated work is necessary.
REPOS=("myrepo1" "yourrepo1" "myrepo2")
for i in ${REPOS[@]}; do
echo "Running gitleaks against ${i}"
REPO="/path/to/repos/${i}"
time gitleaks --path=${REPO} -v > ${REPO}/gitleaks_scan_$(date +"%Y-%m-%d").json
echo ""
echo "Running trufflehog against ${i}"
REPO="/path/to/repos/${i}"
time trufflehog file://${REPO} > ${REPO}/trufflehog_$(date +"%Y-%m-%d").scan
done
Before you begin have all the modified files ready outside your repo. You'll basically be deleting the files to get rid of the history. Again have all the developers synced, a code freeze in place, and a full backup stored remotely somewhere.
Run filter-repo this time specifing the specific file paths.
Add the corrected files back and commit them locally.
Now that you've got your repo squeeky clean locally, you need to push the entire rewrite of the repo history to your git provider.
Start by removing and readding the remote git location. Not sure why, but this helps prevent local errors.
git remote remove origin
git remote add origin https://github.com/acme/your_repo.git
Now force push the altered history to your remote git service.
git push origin --force 'refs/heads/*'
git push origin --force 'refs/tags/*'
git push origin --force 'refs/replace/*'
OR
This Recomendation from github.
git push origin --force --all
git push origin --force --tags
Reach out to the affected developers to let them know the changes are complete. It is suggested that rather than hard reset their repos, that they just archive the ones they have and clone a fresh repo.
Goodluck!