new post - beauty of unix pipelines

Prithu Goswami prithugoswami524@gmail.com

Wed, 13 May 2020 14:10:26 +0530

commit

68f500e2890dd937e2ef256075d52543dd0b2ed1

parent

e17d149c44b5c1b8661b600df72c840dc65d025c

2 files changed, 400 insertions(+), 0 deletions(-)

jump to

hugo-site/content/posts/unix-pipeline/feh-meme.png

hugo-site/content/posts/unix-pipeline/index.md

A hugo-site/content/posts/unix-pipeline/feh-meme.png

A hugo-site/content/posts/unix-pipeline/index.md

@@ -0,0 +1,400 @@ 
+---
+title: "The beauty of Unix pipelines"
+date: 2020-02-02T17:45:30+05:30
+description: "Some examples of using unix tools in a pipeline"
+tags:
+- unix
+- command line
+- scripts
+---
+
+The Unix philosophy lays emphasis on building software that is simple and
+extensible. Each piece of software must do one thing and do it well. And that
+software should be able to work with other programs through a common interface
+-- a text stream. This is one of the core philosophies of Unix which makes it
+so powerful and intuitive to use.
+
+This is an excerpt from [The Unix Programming
+Envirnonment](https://en.wikipedia.org/wiki/The_UNIX_Programming_Environment)
+
+> Even though the UNIX system introduces a number of innovative programs and
+> techniques, no single program or idea makes it work well. Instead, what makes
+> it effective is the approach to programming, a philosophy of using the
+> computer. Although that philosophy can't be written down in a single sentence,
+> at its heart is the idea that the power of a system comes more from the
+> relationships among programs than from the programs themselves. Many UNIX
+> programs do quite trivial things in isolation, but, combined with other
+> programs, become general and useful tools.
+
+I think that explains it pretty well, but in this post I would like to show
+some examples of how you can use unix pipelines to accomplish tasks.
+
+Examples:
+- Printing a leaderboard of authors based on number of commits to a git repo
+- Browse memes from [/r/memes](https://reddit.com/r/memes) and set your wallpaper from [/r/earthporn](https://reddit.com/r/earthporn) 
+- Get a random movie from an IMDb list
+
+## Example 1 - Printing a leaderboard of authors based on number of commits in a git repo
+
+Let's start with a simple one -- display a list of authors/contributors of a git
+repo sorted based on the number of commits and sort the list in descending
+order (most commits contributed at the top). This is a simple task when you
+think of it in terms of piplines. `git log` is used to display commit logs. We
+can pass the `--format=<format>` option to it and mention what format we want
+the commits to be displayed in. `--format='%an'` just prints the author's name
+for each commit.
+
+```bash
+$ git log --format='%an'
+
+Alice
+Bob
+Denise
+Denise
+Candice
+Denise
+Alice
+Alice
+Alice
+```
+Now we can use the `sort` utility to sort them alphabetically.
+
+```bash
+$ git log --format='%an' | sort
+
+Alice
+Alice
+Alice
+Alice
+Bob
+Candice
+Denise
+Denise
+Denise
+```
+
+Next we use `uniq`
+
+```bash
+$ git log --format='%an' | sort | uniq -c
+
+    4 Alice
+    1 Bob
+    1 Candice
+    3 Denise
+```
+According to `uniq`'s man page:
+
+> **uniq** - report or omit repeated lines 
+>
+> Filter adjacent matching lines from INPUT (or standard input), writing to
+> OUTPUT (or standard output).
+
+So `uniq` prints out repeated lines, but only those that appear _adjacent to
+eachother_. That is why we had to pass the output first to `sort`. The `-c` flag
+prefixes each line by the number of occurrences.
+
+You can see the output is still sorted alphabetically. So now all that is
+remaining is sort it numerically. There's a flag for that in `sort`, the
+`-n` flag. It considers the numbers based on their numerical value.
+
+```bash
+$ git log --format='%an' | sort | uniq -c | sort -nr
+
+    4 Alice
+    3 Denise
+    1 Candice
+    1 Bob
+```
+
+The `-r` flag was also included to print the list in reverse order. By default
+it sorts it in the ascending order. And their you have it -- A list of authors
+sorted according to number of commits.
+
+
+## Example 2 - Browse memes from [/r/memes](https://reddit.com/r/memes) and set your wallpaper from [/r/earthporn](https://reddit.com/r/earthporn)
+
+Did you know that you can just append "`.json`" to a reddit url to get a json
+response instead of the usual html? This allows for a world of possibilities!
+One such is browsing memes right from the command line (well not entirely,
+because the actual image will be displayed on a GUI program). We can simply curl
+or wget the url -- https://reddit.com/r/memes.json
+
+
+```bash
+$ wget -O - -q 'https://reddit.com/r/memes.json'
+
+'{"kind": "Listing", "data": {"modhash": "xyloiccqgm649f320569f4efb427cdcbd89e68aeceeda8fe1a", "dist": 27, "children":
+[{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "memes",
+"selftext": "More info available at....'
+...
+...
+More lines
+...
+...
+
+```
+
+I use wget here because it seems like the Curl User-Agent gets treated
+differently. Obviously, you can get around this by simply changing the
+'User-Agent' header, but I just went with `wget`. Wget has a `-O` to provide
+the output filename. Most programs that take such an option also allow a value
+of `-` which represents the standard output or input depending on the context.
+The `-q` option just tells wget to be quite and not print things like progress
+status. Now we get a big JSON structure to work with. Now, to parse and use this
+JSON data meaningfully on the command line, we can use
+[`jq`](https://stedolan.github.io/jq/). `jq` can be thought of as `sed`/`awk`
+for JSON. It has a simple intuitive language of it's own you can refer from
+it's man page.
+
+If you take a look at the response JSON, it looks something like this:
+
+```json
+{
+    "kind": "Listing",
+    "data": {
+        "modhash": "awe40m26lde06517c260e2071117e208f8c9b5b29e1da12bf7",
+        "dist": 27,
+        "children": [],
+        "after": "t3_gi892x",
+        "before": null
+    }
+}
+```
+
+So here we have some response of the type "Listing" and we can see we have an
+array of "children". Each element of that array is a post.
+
+This is what one of the elements of the 'children' array looks like:
+
+```json
+{
+    "kind": "t3",
+    "data": {
+        "subreddit": "memes",
+        "selftext": "",
+        "created": 1589309289,
+        "author_fullname": "t2_4amm4a5w",
+        "gilded": 0,
+        "title": "Its hard to argue with his assessment",
+        "subreddit_name_prefixed": "r/memes",
+        "downs": 0,
+        "hide_score": false,
+        "name": "t3_gi8wkj",
+        "quarantine": false,
+        "permalink": "/r/memes/comments/gi8wkj/its_hard_to_argue_with_his_assessment/",
+        "url": "https://i.redd.it/6vi05eobdby41.jpg",
+        "upvote_ratio": 0.93,
+        "subreddit_type": "public",
+        "ups": 11367,
+        "total_awards_received": 0,
+        "score": 11367,
+        "author_premium": false,
+        "thumbnail": "https://b.thumbs.redditmedia.com/QZt8_SBJDdKLVnXK8P4Wr_02ALEhGoGFEeNhpsyIfvw.jpg",
+        "gildings": {},
+        "post_hint": "image",
+
+        ".................."
+        "more lines skipped"
+        ".................."
+    }
+}
+```
+
+I have reduced the number of key value pairs in `data`. In total there were 105
+items. As you can see there are many interesting data attributes you can fetch
+about a post. The one of our interest is `url` of the post. This isn't the url
+of the actual reddit post but rather it's the url of the content of the post.
+If the post url is what you want then that's `permalink`. So in this case, the
+`url` field is the url to the meme's image.
+
+We can simply get the list of of all the urls of of every post using:
+
+```bash
+$ wget -O - -q reddit.com/r/memes.json | jq '.data.children[] |.data.url'
+
+"https://www.reddit.com/r/memes/comments/g9w9bv/join_the_unofficial_redditmc_minecraft_server_at/"
+"https://www.reddit.com/r/memes/comments/ggsomm/10_million_subscriber_event/"
+"https://i.imgur.com/KpwIuSO.png"
+"https://i.redd.it/ey1f7ksrtay41.jpg"
+"https://i.redd.it/is3cckgbeby41.png"
+"https://i.redd.it/4pfwbtqsaby41.jpg"
+...
+...
+```
+
+Ignore the first two links, those are basically sticky posts that the mods put,
+whose 'url' is same as the 'permalink'.
+
+`jq` reads from the standard input and it's fed the JSON we saw earlier.
+`.data.children` is referring to the array of posts I mentioned earlier. And
+-- `.data.children[] | .data.url` means, "iterate through every element in the
+array and print the 'url' field which is in the 'data' field of every element".
+
+So we get a list of all the urls of the "hot" posts of
+[/r/memes](https://reddit.com/r/memes). If you wanted to get the "top" posts of
+the this week then you can hit https://reddit.com/r/memes/top.json?t=week. For
+top posts of all time? `t=all`, year? `t=year` and so on.
+
+Once we have a list of all the URLs, we can now just pipe it into `xargs`.
+Xargs is a really useful utility to build command lines from standard input.
+This is what xarg's man page says:
+
+> xargs reads items from the standard input, delimited by blanks (which can be
+> protected  with double or single quotes or a backslash) or newlines, and
+> executes the command (default is /bin/echo) one or more times with any
+> initial-arguments followed by items read from standard input. Blank lines on
+> the standard input are ignored
+
+So running something like:
+
+```bash
+$ echo "https://i.redd.it/4pfwbtqsaby41.jpg" | xargs wget -O meme.jpg -q
+```
+
+would be equavalent to running:
+
+```bash
+$ wget -O meme.jpg -q "https://i.redd.it/4pfwbtqsaby41.jpg"
+
+```
+
+Now, we can just pass the list of URLs to an image viewer, like
+[`feh`](https://feh.finalrewind.org/) or
+[`eog`](https://wiki.gnome.org/Apps/EyeOfGnome)
+that accept a URL as a valid argument.
+
+
+```bash
+$ wget -O - -q reddit.com/r/memes.json | jq '.data.children[] |.data.url' | xargs feh
+
+```
+
+Now, feh pops up with the memes and I can just browse through them using the
+arrow keys like they were on my local disk.
+
+{{< figure src="feh-meme2.png" title="Feh screen" width="100%" >}}
+
+Or I could simply just download all of the images using wget, by replacing
+`feh` with `wget` above.
+
+And the possibilities are endless. Another good use of this reddit JSON data is
+**setting the wallpaper** of your desktop to the top upvoted image of
+[/r/earthporn](https://reddit.com/r/earthporn) from the "hot" section.
+
+
+```bash
+$ wget -O - -q reddit.com/r/earthporn.json | jq '.data.children[] |.data.url' | head -1 | xargs feh --bg-fill
+
+```
+
+You can then, if you want, set this up as a cron-job that runs every hour or
+so. I use the `head` command here to just print the first line, which would be
+the top upvoted post. By it's own, `head` seems to do something very trivial
+and unuseful, but in this case, working with other programs, it becomes an
+important part.
+
+You see the power of Unix pipelines? That one single line does everything from
+fetching JSON data, parsing and getting the relevant data out of it and then
+again fetching the image from the URL and finally setting it as the wallpaper.
+
+Another silly thing I used this for was for just downloading memes off of
+/r/memes every two hours. So now I have around 19566 memes taking up 4.5G on my
+disk. Why did I do that? Don't ask me...
+
+
+## Example 3 - Get a random movie from an IMDb list
+
+Let's end it with a simple one. IMDb has a feature where they allow you to make
+lists. You can also find lists made by other users. For example - [Blow Your
+Mind Movies](https://www.imdb.com/list/ls020046354). If you append `/export` to
+the url you get the list in a `.csv` format.
+
+```bash
+$ curl https://www.imdb.com/list/ls020046354/export
+
+Position,Const,Created,Modified,Description,Title,URL,Title Type,IMDb Rating,Runtime (mins),Year,Genres,Num Votes,Release Date,Directors
+1,tt0137523,2017-07-30,2017-07-30,,Fight Club,https://www.imdb.com/title/tt0137523/,movie,8.8,139,1999,Drama,1780706,1999-09-10,David Fincher
+2,tt0945513,2017-07-30,2017-07-30,,Source Code,https://www.imdb.com/title/tt0945513/,movie,7.5,93,2011,"Action, Drama, Mystery, Sci-Fi, Thriller",471234,2011-03-11,Duncan Jones
+3,tt0482571,2017-07-30,2017-07-30,,The Prestige,https://www.imdb.com/title/tt0482571/,movie,8.5,130,2006,"Drama, Mystery, Sci-Fi, Thriller",1133548,2006-10-17,Christopher Nolan
+4,tt0209144,2018-01-16,2018-01-16,,Memento,https://www.imdb.com/title/tt0209144/,movie,8.4,113,2000,"Mystery, Thriller",1081848,2000-09-05,Christopher Nolan
+5,tt0144084,2018-01-16,2018-01-16,,American Psycho,https://www.imdb.com/title/tt0144084/,movie,7.6,101,2000,"Comedy, Crime, Drama",462984,2000-01-21,Mary Harron
+6,tt0364569,2018-01-16,2018-01-16,,Oldeuboi,https://www.imdb.com/title/tt0364569/,movie,8.4,120,2003,"Action, Drama, Mystery, Thriller",491476,2003-11-21,Chan-wook Park
+7,tt1130884,2018-10-08,2018-10-08,,Shutter Island,https://www.imdb.com/title/tt1130884/,movie,8.1,138,2010,"Mystery, Thriller",1075524,2010-02-13,Martin Scorsese
+8,tt8772262,2019-12-27,2019-12-27,,Midsommar,https://www.imdb.com/title/tt8772262/,movie,7.1,148,2019,"Drama, Horror, Mystery, Thriller",150798,2019-06-24,Ari Aster
+```
+
+We can use `cut` to decide what fields we need to print:
+
+```bash
+$ curl https://www.imdb.com/list/ls020046354/export | cut -d ',' -f 6
+
+Title
+Fight Club
+Source Code
+The Prestige
+Memento
+American Psycho
+Oldeuboi
+Shutter Island
+Midsommar
+```
+
+The `-d` option is to specify the delimiter for each field. What are the fields
+separated with? In this case it's a comma (`,`). The `-f` option is the field
+number you want to print. In this case the sixth field is the Title of the
+movie. This also prints the csv header "Title" so to remove it we can just use
+`sed '1 d'`, which just means, **d**elete **1** line from the input stream.
+
+We can then pipe the list of movies into `shuf`. Shuf just shuffles it's input
+lines randomly and spits it out.
+
+```bash
+$ curl https://www.imdb.com/list/ls020046354/export | cut -d ',' -f 6 | sed '1 d' | shuf
+
+American Psycho
+Midsommar
+Source Code
+Oldeuboi
+Fight Club
+Memento
+Shutter Island
+The Prestige
+```
+
+Now just pipe it into `head -1` or `sed '1 q'` which would print only the first
+line. Every time you run this, you should get a random selection.
+
+```bash
+$ curl https://www.imdb.com/list/ls020046354/export | cut -d ',' -f 6 | sed '1 d' | shuf | head -1
+
+Source Code
+```
+
+Now let's say you would also like the URL to be printed along with title, no
+problem, `cut` allows you to specify multiple fields to print using `--field=LIST`
+
+```bash
+$ curl https://www.imdb.com/list/ls020046354/export | cut -d ',' --field=6,7 | sed '1 d' | shuf | head -1
+
+Shutter Island,https://www.imdb.com/title/tt1130884/
+```
+
+There is a problem with this though, if the Movie title has a comma in it, then
+you would get a totally different field value. One way to overcome this is by
+using a python one-liner like this:
+
+```bash
+python -c 'import csv,sys;[print (a["Title"]) for a in csv.DictReader(sys.stdin)]'
+```
+
+```bash
+$ curl -s https://www.imdb.com/list/ls020046354/export |\
+    python -c 'import csv,sys;[print (a["Title"],a["URL"]) for a in csv.DictReader(sys.stdin)]'|\
+    shuf | head -1
+
+Oldeuboi https://www.imdb.com/title/tt0364569/ 
+```
+
+These were just a few examples, there are so many things you can accomplish in
+a single line of shell using pipes.

all repos — website @ 68f500e2890dd937e2ef256075d52543dd0b2ed1

personal website hosted at prithu.xyz, built using hugo