xmj's notes

Big Data Frameworks

One of the most curious results I’ve noticed in the last two years:

Frameworks that exist to supposedly make big amounts of data handling “easier” on average fail horribly, all in the same ways.

When used outside of narrowly-defined means of the user interface (point & click), they obscure so many relevant details you will spend 3/4th of your time trying to navigate their ill-documented APIs - when all they allow you to do is changing a file on a file system, which can be done by a junior sysadmin in an instant.

This is, of course, fine and dandy if you are working in a body-leasing sweatshop and get billed out to clients by the hour (or day) receiving a fixed salary, because obviously you will want to learn myriad new ways of changing files in the cloud, all on a customer’s dime, so you can list experience with the latest and greatest frameworks among customer’s testimonials and in your resume.

Whereas, if you were self-employed, billing by terabytes processed per hour, you’d want to go the “extra mile”, and eliminate extant fragility, one source at a time, until there’s almost nothing left.

You would probably set up a highly repeatable cluster installation using very dumb tools (Perl 5.x, Bourne Shell, maybe Python) on bare metal, and focus on the underlying software; never the framework to manage it.

Incentives matter.