Chunk-A-File: Two Ways To Split A File In Linux
10/21/2014 02:44:00 AM
Jsc Jwhat Jerome Christopher
, Posted in
Linux
,
Shell Scripts
,
Tips'nTricks
Today's generations have it all easy. Even a monkey can take a more awesome picture than you can. Chips have gotten smaller and smaller(computer chips not potato chips!) but better and better. Memory once measured in bytes, and kilobytes, now gets thrown around in gigabytes and terrabytes on the desktop, primitive and bland websites of the pre-y2k era have become dizzyingly attractive, versatile and complex.
Alright, now back to file-splitting. If my memory serves me right, I don't think there were any native Unix or Linux split commands in the 1980s-90s. One would probably have a "splitting headache" just thinking about how to accomplish file splits and end up having to use third party tools! With the arrival of the native Linux split command in the early 2000s, you could, for example, now use the [split --number=10 words.out] command for dividing the words.out file into 10 equal chunks by size, and the [split --lines=9917 words.out] command to split it into 10 equal parts by line counts. However, let's explore two home-cooked scripts that I came up with, using [head] and [tail] commands, for splitting a Linux text file, the first to split it equally into 10 equal parts, and the second script to split it to its tenths both using line counts and not file size to accomplish the division.PREP WORK - CREATE INPUT DATA FILE
PREP 1. We'll use a copy of the [/usr/share/dict/words] file as the input file.
To find the line count to divide the file into 10 parts, first find the total lines, then divide it by 10.
PREP 2. If you find it [/usr/share/dict/words file] missing, don't worry, we'll create another one[wordfile.out] on the fly.
We'll create a dummy file and populate it with output from [ls] command run on [/bin], [/usr/bin], [/usr/local/bin] folders, and even add a [.tmp] suffix to every file name for extra safety.Good. We see that we have generated 2238 lines for our input file [wordfile.out]. Though we have created two input files [words.out and wordfile.out], we'll use the [words.out] file for this exercise. Now to display the line counts for tenths, use this oneliner.
The oneliner above calculates the total and the tenths in one loop. Remember the numbers denote line numbers in the input file.
SCRIPT 1: Script to divide the file into 10 equal parts by line count.
Ok. Here's the first script[ChunkAFile.bash] that divides the input file into 10 equal parts by line count.The image below shows how we list the script name first, then run it, and also display the output files[t01, t02 through t10].
Let's check the wordcount [l(lines);m(characters);c(bytes)]. Towards the bottom of the image below, also notice that the line count is same[9917] for all 10 files on the first column of the listing. This verifies that our ten output files are equal by line count.
SCRIPT 2: Script to divide the file to its tenths by line count.
Let's now examine the second script[ChunkAFileTenth.bash] that divides the input file to its tenths by line count. Here's the scriptSame as above with script1, the image below shows the listing of the script name, then shows it being run, and finally displays the output files[01-tenth.out through 10-tenth.out]. See how their sizes increase with every tenth.
Finally let's also verify the counts [l(lines);m(characters);c(bytes)]
THAT'S ALL FOR NOW!
Okay. We are through with our two scripts! We have come up with new ways to calculate and accomplish how to split a file into ten equal parts, and also to split it to its tenths and verified our results with line counts. Remember we can use the same technic to extend it to split a file to its hundredths[1/100th(1%), 2/100th(2%) through 100/100th(100%) etc.]Hopefully, our two little scripts will inspire us to come up with newer and better ways to accomplish similar familiar tasks, all the while looking out to improve stuff at the same time. Remember to visit sqlhtm.com for the latest tweet/Bible verse viewer "Polar Verses" and more tools!