Stubbisms – Tony’s Weblog

July 10, 2009

Git Script to Show Largest Pack Objects and Trim Your Waist Line!

Filed under: Java — Tags: , , , , — Antony Stubbs @ 2:07 pm

This is a script I put together after migrating the Spring Modules project from CVS, using git-cvsimport (which I also had to patch, to get to work on OS X / MacPorts). I wrote it because I wanted to get rid of all the large jar files, and documentation etc, that had been put into source control. However, if _large files_ are deleted in the latest revision, then they can be hard to track down.

The script effectively side step this limitation, as it simply goes through a list of all objects in your pack file (so try and run git gc first, so that all your objects are in your pack), and list the top largest files, showing you their information. The, with the file locations, you can then run:

# remove a tree from entire repo history
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

# pull in a repo without the junk
a git pull file://$(pwd)/myGitRepo

Which will remove them from your entire history, trimming your waist line nicely! But be sure to follow the advice from the man page for filter-branch – there’s things you should be aware of, such as old tags (that one got me) etc… Rather than messing around trying to get it exactly right, I actually just retagged the new repo by matching the dates of the tags from the initial cvsimport – there were only 9 after all!

But for reference, here is the command I’m referring to, from the git-filter-branch man page:

You really filtered all refs: use –tag-name-filter cat — –all when calling git-filter-branch.

There’s a few different suggestions as to how to remove the loose objects from your repository, in order to _really_ make it shrink straight away, my favourite being from the man page:

git-filter-branch is often used to get rid of a subset of files, usually with some combination
of –index-filter and –subdirectory-filter. People expect the resulting repository to be
smaller than the original, but you need a few more steps to actually make it smaller, because
git tries hard not to lose your objects until you tell it to. First make sure that:

o You really removed all variants of a filename, if a blob was moved over its lifetime. git
log –name-only –follow –all — filename can help you find renames.

o You really filtered all refs: use –tag-name-filter cat — –all when calling
git-filter-branch.
Then there are two ways to get a smaller repository. A safer way is to clone, that keeps your
original intact.

o Clone it with git clone file:///path/to/repo. The clone will not have the removed objects.
See git-clone(1). (Note that cloning with a plain path just hardlinks everything!)

Apart from the section on “are your objects _really_ loose?”, the most useful bit of information was running the git-pull command, which someone suggested from the discussion on the git mailing list. This was the only thing that actually worked for me, contrary to what it states about git-clone. However, be careful, as git pull by default doesn’t pull over all information…

And without further a due, here is the script:

#!/bin/bash
#set -x 

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
for y in $objects
do
	# extract the size in bytes
	size=$((`echo $y | cut -f 5 -d ' '`/1024))
	# extract the compressed size in bytes
	compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
	# extract the SHA
	sha=`echo $y | cut -f 1 -d ' '`
	# find the objects location in the repository tree
	other=`git rev-list --all --objects | grep $sha`
	#lineBreak=`echo -e "\n"`
	output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

Thanks to David Underhill for the inspiration, and the various posts on the git mailing list!

For other migration tips (svn) – see here: http://fpereda.wordpress.com/2008/06/11/how-i-migrated-paludis-to-git/

P.s. if someone tries running the script on Linux or Cygwin and it needs modifying, let me know and I’ll post the modified versions all next to each other in this article.

31 Comments »

  1. This script works without modification on OpenSuse 11.4. I have to thank you! I thought I cleaned up my repository properly but the size was still > 300 MB, finally, with your script I found a long forgotten/deleted directory still packed in there 😉

    Comment by rayburgemeestre — April 16, 2011 @ 8:37 am

  2. here’s a script that is looking for file renames in place, and is much faster: http://pastie.org/3894536
    and here’s the script to actually delete those file from the repo: http://pastie.org/3894555

    Comment by Dávid Debreczeni — May 12, 2012 @ 12:44 am

  3. @David Nice!

    Comment by Antony Stubbs — May 12, 2012 @ 6:17 am

  4. in Git for Windows environment, we don’t have the column command, so I removed it from your script. It worked just fine – except the format was of course not column aligned. But still perfectly useful, and I thank you!

    Comment by northben — June 3, 2013 @ 3:12 pm

  5. […] found this script that gives you a nice list of files ordered by size: https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-wais… ( I haven’t figured out yet how to remove from history the same file with different […]

    Pingback by Cleaning a git repository | Climb the problem — August 23, 2013 @ 11:25 am

  6. Hi, I have used your guide but i am interested to know how you got over this particular issue : “However, be careful, as git pull by default doesn’t pull over all information…” ? I mean if i do clone i can see the size is not shrinking and if i do pull it shrinks BUT i am missing all the branches and tags and maybe something else i have not been aware of. How can i pull into new repo but get ALL information before i push the slimed down repository to my server ?

    Comment by Mihai Marinescu — February 4, 2014 @ 9:53 pm

  7. I’ve modified the following line:

    objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

    by prefixing the path to pack files with the $GIT_DIR:

    GIT_DIR=`git rev-parse –git-dir` || exit $?
    objects=`git verify-pack -v $GIT_DIR/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

    Now the script works for bare repos too and from any subdir (including the .git itself).

    Comment by Andrew Nefedkin — February 25, 2014 @ 10:52 pm

  8. […] step by step: – use the a biggit_file_finder.sh to get the biggest files. – check any dead ducks to kill(big uneeded files, or whole dirs with […]

    Pingback by shrink git repo to a small size | yanoobblog — May 20, 2014 @ 2:45 am

  9. […] It is a very powerful tool in your git arsenal. There are already helper scripts available to identify big objects in your git repository, so that should be easy […]

    Pingback by How To Handle Big Repositories With Git - SkyOffice Consulting | SkyOffice Consulting — May 23, 2014 @ 10:04 am

  10. […] 間違ってコミットした大きなバイナリ遺産や今後使用することのない古いアセットを含む巨大リポジトリの場合は、filter-branch が非常に有用なソリューションです。このコマンドを使用すると、プロジェクトの履歴全体を調べて、あらかじめ設定したパターンにしたがってファイルの抽出、修正、変更、除外などの処理を行うことができます。これは Git を利用するプロジェクトにとって非常に強力なツールです。リポジトリ中のサイズの大きなオブジェクトを調べるためのヘルパースクリプトもすでに提供されており、簡単に利用できます。 […]

    Pingback by 巨大なリポジトリを Git で上手く扱う方法 | Atlassian Japan — May 27, 2014 @ 11:24 am

  11. […] Script (will show top 10 biggest files): […]

    Pingback by Yeri » Blog Archive » Deleting big files from your git history — June 9, 2014 @ 5:09 am

  12. That loop is really inefficient because it’s calling `git rev-list –all –objects` every time through. That command spits out a list of every object in the repository. In my 7-year repo, that list is enormous and takes a long time to generate. Fortunately, the inefficiency is easily fixed by creating a new variable above the for:

    allObjects=`git rev-list –all –objects`

    And then re-writing the other= line like so:

    other=`echo “${allObjects}” | grep $sha`

    Comment by nkocharh — June 17, 2014 @ 11:06 pm

  13. There are several other ways to speed up this script. They are incorporated into the rewritten version here: https://gist.github.com/nk9/b150542ef72abc7974cb

    Comment by nkocharh — June 24, 2014 @ 8:50 am

  14. […] It is a very powerful tool in your git arsenal. There are already helper scripts available to identify big objects in your git repository, so that should be easy […]

    Pingback by How to handle big repositories with git | Pix software beneluxPix software benelux — July 2, 2014 @ 12:53 pm

  15. Hi, thanks for your script. I used it to find large files in my git repository. The two largest objects it shows have no SHA and location. What does that mean?

    All sizes are in kB’s. The pack column is the size of the object, compressed, inside the pack file.
    size pack SHA location
    1433301 262013
    440978 80296
    28 4 9cb30851652bd4a4bfd11244f1cc73ae8db64631 python/probability/.fractionActiveQueries.py.swp

    Comment by Sumeet K — July 14, 2015 @ 12:36 pm

  16. hey – i really need to do this but not sure where this scripts should be saved, what name or how it should be used. i.e. in the repo etc.

    Any clarification would be much appreciated!

    Comment by dpcwp — October 25, 2015 @ 12:31 am

  17. Warning: this script fails in many locales, including those using space for [Digit grouping](https://en.wikipedia.org/wiki/Decimal_mark#Digit_grouping)
    Symptom: irrelevant objects are ranked at the top.

    Long story short, adding the line below near the beginning of the script is enough to solve this issue.

    export LC_ALL=C

    Example before fix:

    All sizes are in kB’s. The pack column is the size of the object, compressed, inside the pack file.
    size pack SHA location
    0 0 4dd7dae4f122af7764048d5e3d736253f93dd4f4
    1 0 e17cf593175a915108f1c527e742539642414848 Net.DDP.Client/JsonDeserializeHelper.cs
    1183 1178 fb831edc88ab20ee23e4e5cae28245bd183e32e3 packages/Newtonsoft.Json.4.5.9/Newtonsoft.Json.4.5.9.nupkg
    596 243 9cba6edbfc43ef7cc1ee121d6bafe5b37cd98f1a NuGet.DDPClient/NuGet.exe
    556 235 324daa842c51a9936d8baeabbc21da1e01bc9a4d NuGet.DDPClient/NuGet.exe
    0 0 42736bfcf12ffbb6310925cdb0324c5fa93acfeb Net.DDP.Client/JsonDeserializeHelper.cs
    1 0 f9ebe862cb66aa4c15aee45d8379c63e5012f289 Net.DDP.Client.sln
    0 0 a8283d68b146511c97d2efdea83ebe9ec7647a06 NuGet.DDPClient/Properties/AssemblyInfo.cs
    1 0 ef530adfb46c2490349ae32c512a58f15ef0e93b Net.DDP.Client.sln
    1 0 3d78ffce0172f51e71a78bf78dd7af46846c0c65 Net.DDP.Client.sln

    After fix:

    All sizes are in kB’s. The pack column is the size of the object, compressed, inside the pack file.
    size pack SHA location
    1625 510 8dd7e45ae75d1a55fc669f09bdef4a49b16a95dd NuGet.DDPClient/NuGet.exe
    1183 1178 fb831edc88ab20ee23e4e5cae28245bd183e32e3 packages/Newtonsoft.Json.4.5.9/Newtonsoft.Json.4.5.9.nupkg

    Comment by Stéphane Gourichon — January 9, 2017 @ 9:04 am

  18. […] have also tried this nice idea, but it does not show SHAs or any other lead on where to cut the […]

    Pingback by git gc on bare repository does not clean up – program faq — January 23, 2018 @ 8:07 pm

  19. […] このサイトのスクリプトをダウンロード。スクリプトの中身は以下の通り。 […]

    Pingback by うっかりコミットしてしまった余分なファイルを.gitから削除する方法 — February 4, 2018 @ 2:42 pm

  20. […] Git Script to Show Largest Pack Objects and Trim Your Waist Line! […]

    Pingback by git Single Branch clone ha crecido – cómo devolverlo al tamaño original Git & Github — April 27, 2018 @ 4:57 am

  21. […] Git Script to Show Largest Pack Objects and Trim Your Waist Line! […]

    Pingback by GIT Single Branch клон вырос – как вернуть его в исходный D — January 11, 2019 @ 3:07 pm

  22. http://www.Battleactsmain.ca/wiki/Tips_On_How_To_Find_The_Very_Best_Nclex_Review_Course

    Git Script to Show Largest Pack Objects and Trim Your Waist Line! | Stubbisms – Tony’s Weblog

    Trackback by Nclex practice tests — February 25, 2019 @ 6:12 pm

  23. […] Git Script to Show Largest Pack Objects and Trim Your Waist Line! […]

    Pingback by Найти фиксацию, когда файл был добавлен во &# — March 18, 2019 @ 10:25 am

  24. Hello Tony,

    Thanks, your script helped me and to thank you here is my contribution (http://saksoook.blogspot.com/2021/07/i-am-contributing-back-some.html)

    Thanks,

    Comment by Khamis Siksek — July 12, 2021 @ 1:40 am

  25. […] is due to Antony Stubbs here – his Bash script identifies the largest files in a local Git repository, and is reproduced […]

    Pingback by How to shrink the .git folder - The Citrus Report — February 28, 2023 @ 8:17 am

  26. […] is due to Antony Stubbs here – his Bash script identifies the largest files in a local Git repository, and is reproduced […]

    Pingback by [Git] How to shrink the .git folder - Pixorix — May 29, 2023 @ 8:32 am


RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.