Infinite Undo! RSS

Falsehoods About Time · Code Coverage · Interview Questions · Functional Testing


 
 

Hello, I'm Noah Sussman

Oct
6th
Mon
permalink

Convert CSV to JSON with jq

jq the JSON processor is a new tool for command-line data mining. I’ve been using jq for a couple of months and it’s super useful.

In this post I present a simple example of converting CSV to JSON using jq. Parsing CSV with jq is important because while lots of tools will read a CSV file, most of those tools (eg Excel, Google Docs, your text editor) are not easy to use in automation scripts.

For example purposes, I am using sdss1738478.csv — a 200MB (about 1.7 million lines) CSV data set containing stellar object coordinates from Sloan Digital Sky and downloaded from the the Real-World Data Sets archive at Princeton.

Download sdss1738478.csv (200MB)

Step 1: Visual examination of the raw data

Having downloaded sdss1738478.csv, the first thing I want to do is take a look at the first few lines of the file to get a sense of what I am dealing with. For one thing I expect that the first line of the file will be a list of column headings which I can use to structure the JSON output I’m going to generate.

Use head

I use the head command to examine the column header row plus a couple of lines of data. Seeing this helps me enormously to design the command that will convert this file to JSON.

$ head -n3 sdss1738478.csv

objectid (long),right ascension (float),declination (float),ultraviolet (double),green (double),red$
758882758529188302,263.4087,6.278961,24.5967464447021,23.471773147583,21.6918754577637,21.1872177124023,20.4352779388428
758882758529188203,263.4428,6.38233,26.7489776611328,25.1870555877686,21.7422962188721,22.282844543457,22.1947193145752

head is also useful here because sdss1738478.csv is over a million lines long. Using head allows me to abstract away the complexity of working with a very large file and just concentrate on figuring out the column headings and the data format.

Step 2: Define an output format

Now that I have a sense of what the raw data looks like I can sketch out what I’d like the finished data set to look like — once it has been converted into JSON.

The simplest thing would be to deliver an array of JSON records (one for each line) containing labeled data fields.

The column headings obviously should be used as the data labels but I’m going to edit them slightly so that they’ll make naturally convenient keys for a JSON object.

Here’s some psuedocode to illustrate the format I’ve designed:

[
   {
      "id"          : <data1>,
      "ascension"   : <data2>,
      "declination" : <data3>,
      "ultraviolet" : <data4>,
      "green"       : <data5>,
      "red"         : <data6>
   },
   ...
]

Step 3: Write a jq expression against a limited data set

Now that I have my format, I need to create a factory that takes a line of CSV as input and returns an array of formatted JSON records.

Here’s what that looks like with jq. (Still truncating the data file with head to make it easy to see what’s going on):

head -n3 sdss1738478.csv | \
jq --slurp --raw-input --raw-output \
    'split("\n") | .[1:] | map(split(",")) |
        map({"id": .[0],
             "ascension": .[1],
             "declination": .[2],
             "ultraviolet": .[3],
             "green": .[4],
             "red": .[5]})'

Note I’m using the expression .[1:] to skip the first line of the data file since that row just has the names of the column labels. The ability to omit lines directly from jq means less need to dirty up raw data files by “pre-processing” them before parsing.

And here is what my output looks like:

[
  {
    "id": "758882758529188302",
    "ascension": "263.4087",
    "declination": "6.278961",
    "ultraviolet": "24.5967464447021",
    "green": "23.471773147583",
    "red": "21.6918754577637"
  },
  {
    "id": "758882758529188203",
    "ascension": "263.4428",
    "declination": "6.38233",
    "ultraviolet": "26.7489776611328",
    "green": "25.1870555877686",
    "red": "21.7422962188721"
  }
]

Step 4: Convert CSV to JSON!

At this point I am done testing so I no longer need to truncate the data file with head. But this will generate around 13 million lines of output so I want to redirect that to a file!

jq --slurp --raw-input --raw-output \
  'split("\n") | .[1:] | map(split(",")) |
      map({"id": .[0],
           "ascension": .[1],
           "declination": .[2],
           "ultraviolet": .[3],
           "green": .[4],
           "red": .[5]})' \
  sdss1738478.csv > sdss1738478.json

Running the expression on the whole data file takes just under 1 minute on my 2013-era Macbook Air! That’s an awesome amount of power from a very short command.

Oct
4th
Sat
permalink

A scientific alternative to waving your hands and mumbling Dunbar’s Number

An illustration of Metcalfes Law

Metcalfe’s Law describes how the number of links in a communication network scales over time. When visualized this way, large networks produce a characteristic graph known as "The Mystic Rose" or “Metcalfe’s Wheel" (shown at right in the illustration above). The "wheel" starts out simple but becomes more baroque and difficult to comprehend as the graph gets larger.

Specifically what Metcalfe’s Law says is that bigger networks are more valuable because they have more connected nodes. And as a corollary: bigger networks are more complex because they have more connected nodes.

I like to think of Metcalfe’s Wheel as a visualization of the maximum number of possible frantic late-night IM conversations that can occur during an outage. The more paths of communication, the more potential for miscommunication and misunderstanding.

Metcalfe’s Law provides much-needed insight into Dunbar’s Numbers

Today most people would agree that Dunbar’s Numbers are the best measure available when trying to explain the observation that overall productivity tends to drop as an organization grows larger. But a problem arises as soon as one finds out that that Dunbar himself said his numbers were guesswork.

Dunbar proposed that inflection points exist and famously guessed at where the inflection points might lie. But he did not arrive at exact numbers. Despite this, exact numbers are bandied about the Web under the moniker "Dunbar’s Number," implying One True Number exists when as yet science has said nothing definitive at all on the topic.

Dunbar proposed that inflection points exist and guessed at where the inflection points might lie. But he did not arrive at exact numbers.

Good leaders put a lot of time and love into predicting and understanding inflection points in the growth of their organizations. This is important work and I do agree that Dunbar’s Numbers are the best tool we have to work with. However it’s a very blunt tool.

Metcalfe’s Law models value and it can also model communication cost.

Modeling the communication paths in your organization as a fully connected graph is at least not guesswork and the algorithm is legitimized by its use in Metcalfe’s Law. And taking this approach exposes all of the analysis tools of graph theory, none of which are directly applicable to Dunbar.

As illustrated by Metcalfe’s Law, fully connected graphs are a nice way to measure both the value and the complexity cost of communication within a network of people.

permalink
Flow chart showing what programmers do all day.

Sadly I&#8217;m not sure who drew it :-/

Flow chart showing what programmers do all day.

Sadly I’m not sure who drew it :-/

Oct
3rd
Fri
permalink

How to convert a logfile to JSONp with jq the JSON processor

jq is a new tool for command-line data mining. I’ve been using jq for a couple of months and I am very impressed.

JSONp is a data-exchange format for HTML5 applications that works around the Same Source policy. It’s a very powerful resource for aggregating Web services data into HTML5 dashboards.

In this post I present a simple example of transforming plain text logfiles into JSONp with jq.

For example purposes, I am using a Web log data set from The Web Server Workload Characterization project. as I have previously discussed here.

How to convert lines of plain text into a JSONp array

Since a standard Web log file contains information broken up into lines, I just need to split the logfile on newlines in order to arrive at an array which I then wrap in a call to a predefined JavaScript function I have chosen to call jsonpHelper

head -n5 NASA_access_log_Jul95 | \
    jq --slurp --raw-input --raw-output \
    'split("\n") | @text "jsonpHelper(\(.));"'

Which produces JavaScript output like the following (I’ve added whitespace for readability).

jsonpHelper(["in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] \"GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0\" 200 1839",
             "uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] \"GET / HTTP/1.0\" 304 0",
             "uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] \"GET /images/ksclogo-medium.gif HTTP/1.0\" 304 0",
             "uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] \"GET /images/MOSAIC-logosmall.gif HTTP/1.0\" 304 0",
             "uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] \"GET /images/USA-logosmall.gif HTTP/1.0\" 304 0"]);

Once this output has been saved to a file, I can immediately serve that file from any Web server anywhere and HTML5 applications will be able to take advantage of it!

Sending JSONp over the wire with a few lines of JavaScript!

Just drop the following snippet of jQuery into any HTML page in order to have access to the logfile data in the JavaScript DOM.

var myLogFile;

var jsonpHelper = function(payload){return payload;};

$.ajax({
    url: '<URL OF YOUR JSONP FILE>',
    dataType: 'jsonp',
    jsonp: false,
    jsonpCallback: 'jsonpHelper',
    success: function(data){myLogFile = data;}
});

I have chosen to store the data in a global variable called myLogFile so that I can work with it later.

Here is a screenshot of retrieving generated JSONp via Chrome Inspector

a working example

Thanks for reading all the way to the bottom =D

If you found this information useful then you might also be interested in a more advanced example of using the jq JSON processor to work with log files and JSON data consumers!

Oct
2nd
Thu
permalink

Convert plain text logs into CSV with jq: a complex example

jq is a new tool for command-line data mining. I’ve been using jq for a couple of months and I am very impressed.

In this post I present a complex example of munging logfile data into CSV with jq.

For example purposes, I am using a Web log data set from The Web Server Workload Characterization project.. The format of the log lines is specified as follows:

  1. host
  2. timestamp
  3. request
  4. response code
  5. size in bytes

Here is how I would extract the time stamp and the response code

So based on the data format described above, I can assume that my log file looks like this:

$ head -n5 NASA_access_log_Jul95

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839
uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0
uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0

And I can extract the time stamp and response code from every request, and export it to CSV, like this:

$ head -n5 NASA_access_log_Jul95 | \
    jq --slurp --raw-input --raw-output \
    'split("\n") | map(split(" ") | [(.[3] | ltrimstr("[")), .[8]]) | [["timestamp", "response code"], .[]] | .[] | @csv'

"timestamp","response code"
"01/Aug/1995:00:00:01","200"
"01/Aug/1995:00:00:07","304"
"01/Aug/1995:00:00:08","304"
"01/Aug/1995:00:00:08","304"
"01/Aug/1995:00:00:08","304"

That’s an amazing amount of power from a very short command. And it doesn’t in practice seem to matter how many lines one is dealing with either. jq happily churns through the whole data set — 1,891,714 lines and around 196MB — in about 20 seconds on my machine.

Sep
24th
Wed
permalink

This is why you can’t hire software engineers in test

A Dilbert cartoon about QA

Testing is an area of software engineering that does not historically draw the widespread respect accorded to "normal" software engineers. QA and test automation have historically been seen as less-skilled, more junior roles, perhaps even drudge work that could be replaced with adequately sophisticated tools.

Industry standard: A QA Manager generally gets paid the same as a Senior Software Engineer

It says something that it’s considered a perk to be hired into a position where, as a software engineer, one can work on testing tools but not be publicly identified as “in test.” Believe me, in my travels as a consultant this is a theme I have heard a lot with regard to why engineers do/don’t choose to work with various organizations.

It is demonstrably impossible to build and run a reliable system without engineers who excel at “maintenance,” also known as “keeping the site up.” Testing (aka debugging) makes up the better part of maintenance.

However for historical reasons the software industry suffers from a persistent condition in which QA doesn’t get the respect it deserves for the most part, in most organizations. As a corollary, organizations that do value and reward testing professionals, wind up recruiting all the (already rare) competent people in the field.

And this is the state of the (Web and mobile) software industry right now: everyone who is good at being an engineer “in test” already works for a company that is willing to pay above the industry mean for testing expertise.

Escaping the stigma associated with being a tester

Once hired, these “engineers who test” have no incentive to publish publicly nor to be publicly seen as a voice in the testing community — remember the industry standard is to pay testers less than other kinds of engineers.

The result is a stagnating industry where almost no one with deep technical testing knowledge is motivated to share it.

So… does your organization ignore industry standard salaries and pay testers like their expertise is a rare and valuable software engineering specialty?

Sep
18th
Thu
permalink
The JavaScript port of my poetic template engine has run into a predictable edge case ;-)

Still qualifies as a freeform haiku tho © 2014 Noah Sussman all rights reserved.

The JavaScript port of my poetic template engine has run into a predictable edge case ;-)

Still qualifies as a freeform haiku tho © 2014 Noah Sussman all rights reserved.

Sep
3rd
Wed
permalink
The Jenkins Wall display project

This an MVP iteration of a CI / Testing dashboard. I have been working on this idea for several years now, but have only just gotten to a point where any of the code is general enough to be released.

http://github.com/textarcana/jenkins-wall

The screenshot above shows the Jenkins Wall displaying build statuses from the Mediawiki Jenkins instance.

Colors and other display features

White boxes correspond to jobs that are queued, red corresponds failed builds, green is a passing build and the grey box at bottom indicates a job that is building right now.

Clicking anywhere on a box will start the corresponding build, assuming you have permission to start builds on that Jenkins instance. I&#8217;ve also provided direct links to job configuration pages (again assumes you have permission to configure builds in the first place).

I&#8217;ve got lots more features implemented and will be extracting them on an ongoing basis from now on.

The Jenkins Wall display project

This an MVP iteration of a CI / Testing dashboard. I have been working on this idea for several years now, but have only just gotten to a point where any of the code is general enough to be released.

http://github.com/textarcana/jenkins-wall

The screenshot above shows the Jenkins Wall displaying build statuses from the Mediawiki Jenkins instance.

Colors and other display features

White boxes correspond to jobs that are queued, red corresponds failed builds, green is a passing build and the grey box at bottom indicates a job that is building right now.

Clicking anywhere on a box will start the corresponding build, assuming you have permission to start builds on that Jenkins instance. I’ve also provided direct links to job configuration pages (again assumes you have permission to configure builds in the first place).

I’ve got lots more features implemented and will be extracting them on an ongoing basis from now on.

Aug
22nd
Fri
permalink

WHAT IS YOUR SHARK ATTACK BACKBONE OUTAGE MITIGATION POLICY or do you even have one…

OTOH if you tell me you game day’d this shit then I want to work for you call me

YOU: Yeah so for game days we generally flood the DC and then the CEO gets to release a shark, which can bite any cable but the system will stay up.

ME: This interview is over where is my desk.

Also sharks bite undersea fiber optic cables. It is real. It is a thing.

Aug
3rd
Sun
permalink

Generative drawings done with FlowPaper

Aug
2nd
Sat
permalink
terrysdiary:

NOTHING WORKS

That&#8217;s real.

terrysdiary:

NOTHING WORKS

That’s real.

permalink
generative drawing made with FlowPaper

generative drawing made with FlowPaper

Jul
30th
Wed
permalink

Punctuated Equilibrium and Intractable Systems

Punctuated equilibrium is a theory in evolutionary biology which proposes that most species exhibit little evolutionary change for most of their history. When significant evolutionary change occurs, it is generally restricted to rare and rapid (on a geologic time scale) events of branching speciation.

Wikipedia

It is a commonly accepted principle that software systems evolve, much like biological ecosystems. And in general the crossovers between programming and genetics are well-known.

A large system that works is invariably found to have grown from a small system that worked —Anonymous

Erik Hollnagel’s concept of Tractable Systems and Intractable Systems assumes that in general there is a an unknown but relatively discrete inflection point for a simple system beyond which the system has become so complex that it must now be considered intractable.

Large intractable systems evolve from simple, tractable systems. A transition point between tractable and intractable exists somewhere along the timeline of an organization’s life.

Once a system flips over to being intractable, then it stops mattering how “big” the system is — your concept of system complexity on some objective scale stops being relevant once system behavior has become non-deterministic.

One always has, at every stage in the process, a working system. I find that teams can grow much more complex entities in four months than they can build.

— Fred Brooks

Jul
29th
Tue
permalink

Generative drawings done in the FlowPaper iOS app.

Jul
26th
Sat
permalink
Video feedback drawing, created with iPhone, Apple TV and airplay.

Video feedback drawing, created with iPhone, Apple TV and airplay.