# Basic bash for NLP tutorial

Bash is a language that allows for easy manipulation of Unix-like computing environments. Frequently in NLP research, quick simple tasks like “how many lines are there in this file” or “how can I copy a file to 100 places at the same time” arise, and can best be solved with bash, and some easily-usable affiliate programs.

Beyond this, Bash is much more versatile and useful than many give it credit for, and advanced knowledge of bash can make projects involving very large systems and/or multi-step pipelines truly a joy.

## When to use bash; When not to

I love bash dearly. That’s not to say it’s the answer to every problem in NLP.

### Use bash when

• You need to move, and/or rename one or many files
• You need to set up folder structures, create symbolic links (more on this later), and generally change the file system.
• You want basic statistics on one or many files (e.g. “I have 3,000,000 lines of log files, and if the word ‘error’ appears, something failed. Did it ever appear?”)
• You want to submit multiple jobs to the scheduler with similar parameters
• You want to set up a “pipeline”, which means a set of processing steps that takes input files and produces output files, using multiple independent programs.

### Don’t use bash when

• You need to do a lot of arithmetic computation, especially division/multiplication
• You need library support (for, say, reading TSVs, or JSON, etc.)
• You plan to make copious use of fun data structures like sets, counters, UnionFind, MyArbitraryClass, or TrieHeapMapBellmanFordImpl

## File system and unix basics

As we said in use bash when, we frequently use bash to manipulate a Unix file system. The unix filesystem is a rooted tree-like data structure. Its root is /. This forward slash is known as “root”. Below is a diagram showing a selection of the root of the nlpgrid filesystem(s).

When using bash, you are always “at” one of the nodes in the filesystem tree. Every command you run is run relative to your current node, or “folder”, or more commonly, “directory.” A “path” is a list of names of folders combined with forward slashes (/) that takes you either from root (an absolute path) or from any arbitrary directory (a relative path) to some other directory.

### Knowing where you are

When running bash commands, ., the period, corresponds to your current directory. When reading or writing files, the . is implicit. Let’s demonstrate this with a file. Open up a terminal and run

    touch testfile


This creates a file named testfile if it doesn’t exist, or updates the ‘last-read’ flag on the file if it already existed. Now, list the contents of your current directory with

    ls


You should see (possibly among other things) the file testfile. You can demonstrate that this is equivalent to:

    touch ./testfile
ls .


In both cases, the . was implicit.

However, when specifying commands, the . is not implicit. For example, try running

    ./touch testfile


You should find that bash complains with a command-not-found error. This is because you told bash “look for the command touch; it’ll be in this directory.” Without the prepended ./, the command touch is found by other means, which we’ll find later.

Specifying the location of files you want to work with can be more complex than just looking in your current directory, however. For example, make a directory within your current directory with

    mkdir testfolder


Now, make a new file in that directory, and demonstrate that the file is in the directory, with

    touch testfolder/testfile2
ls testfolder


You can also change your working directory, which changes the location from which your commands are run. For example, if we run

    cd testfolder
ls


We should see testfile2, not testfile. To return to the directory just 1 “above” (or, closer to /) from the current directory, run

    cd ..


Where .. is short for “parent directory”.

Here are some other quick commands you’d otherwise have to search for at some point:

• cd - sends you to the last directory you were in, whatever that was, and prints the name of that directory.
• ~ the tilde represents your “home” directory. On my laptop, for example, thats /home/johnhew. If you run a command with the tilde, it will be replaced by the location of your home diretory.
• * the asterisc stands for “everything within the directory I’ve just specified”. Thus ls * means “list the contents of every file/directory in the current directory.”
• pwd is a command that prints the absolute path of the current working directory
• tree is a fun commands that lists the tree structure starting from the current working directory and searching “down” (away from /).

### File copying and movement

To copy a file located at oldloc to a new location newloc, use the cp command as follows:

    cp oldloc newloc


Seems simple enough. oldloc and newloc can each be an absolute path, like /nlp/data/johnhew/expts/realgooddata.txt.png, or a relative path like realgooddata.txt.png. The semantics of cp allows you to copy multiple files to one location, but not copy 1 file to multiple locations. By this I mean when you run:

    cp loc1 loc2 loc3 loc4 loc5


You’re copying loc1 through loc4 all into loc5. If loc5 is a directory and loc1,loc2,loc3,loc4 all have unique filenames, all 4 files will be copied into the loc5 directory. If loc5 is a file (or didn’t exist previously), each file will be copied sequentially into loc5, squashing all old files. In other words, all you’ll end up with at loc5 is (in this case) loc4.

If you don’t want to keep the old file locations, use the mv (move) command instead of cp.

Moving files is great, but actually working with files contents is great too.

### An optional aside : vim

To integrate file editing into your bash workflow, I highly suggest putting in the time to learn vim, the world’s best text editor. There’s a bit of a learning curve, but many things are googleable, and I’ll start you off with two things:

• Use :q to quit.
• Here’s a good starter vimrc, which is a configuration file. Google for more info.

 filetype indent on
filetype plugin on
set expandtab
set mouse=a
set pastetoggle=<F2>
syntax enable
set tabstop=2
set shiftwidth=2
set softtabstop=2
set autoindent
set whichwrap+=<,>,h,l,[,]
set cursorline
set number
set hlsearch
set lazyredraw
set omnifunc=syntaxcomplete#Complete

## Epilogue: compression

If you find files with the extension .tgz .tar.bz2, .tar.gz, never fear. These are your friends in the unix environment.

A tar file is an archive – that is, someone took a directory or a bunch of files, and stuck them together into one file for ease of transport. A gzipped or bzipped file is a compressed file; that is, someone used awesome compression algorithms to make a single file take up less space on disk. These strategies are frequently used together, leading to the file extensions seen above. Let’s look at what do to:

• “I have a .tar.gz, and I want to get what’s inside it”

  tar xzvf file.tar.gz

• “I have a folder and/or some files file1, directory3, etc. , and I want to make it into an archive:

  tar czvf file.tar.gz file1 file2 directory3 directory4

• “I want to know what xzvf and czvf mean”. Good question. -x means “extract” (i.e., from the archive). -z means “use gzip compression” -v means “be verbose; that is, tell me what files you’re working on compressing/extracting right now by printing their paths to stdout.” -f means “I’ll specify the file that you’re going to work with, and it’s coming up next.”