Symmetric Labor Metrics - Vision

This is the first part of the Series. Laying out a vision and an MVP (minimum viable product)

Background & Context

Labor market dynamics is probably as fundamental concern for all of us much like eating, sleeping and other needs in life. Understanding labor market has significant implications in the areas of for e.g.

  • Matching supply and demand of labor
  • Identify supply and demand of labor skills by geo location
  • Readiness index “before” venturing into a new geo location — to ensure there is capacity to scale

The roles involved in the labor market

  • Employer/Company (who hires labor)
  • People (the actual labor)
  • Government (from governance and overall watchdog — ensuring fairness)

Problem Statement

Few questions, answers to which have significant implications in labor market

How much salary can I ask for the job ? - Job Seeker

How much salary do I pay for the job ? - Employer 

What is a “fair” salary for a job in the labor market ? - Government

The answer to each of the above questions “depends” on too many variables that move daily in the labor market (some affected by macro-economic movements like stock market, technology, GDP and others micro-economic like “scope”, negotiation, public vs. private sector etc.)

Like KBB (Kelly blue book) index for car prices in the US (free for certain high-level data and paid for granular data) - Is it possible to track labor metrics (by geo, by wage, by ..etc.) ?

Is this an unsolvable problem ?

If we think about it, IF all the salary data across all employers in every country is aggregated in one place, data analysis is the easier part and given the big data technologies we have today, it is no longer a technical constraint to get us all answers and be able to publish a “salary index” much like stock price or any other indicators. The index itself might be “lagging” (i.e. it is based on regression analysis on historical data), or “leading” (apply data science to predict the value based given the list of values for all other variables or a subset of them).

Trivia

In this information age, raw data is the key differentiator giving competitive edge to the company that owns the data. Data intensive companies are valued higher, sustainable for longer term and frankly for-profit organizations are scrambling to collect as much data as they can to stay relevant. The “cloud revolution” with AWS, Azure, GCP — while on one hand is the next step in the digital evolution , on the other hand become exponentially valuable for each client that on boards to their environment, dropping in their data (think of Amazon S3, Azure BLOB etc.)

  • But as we know NO entity (employer/company/organization) would prefer sharing data out there for analysis for social reasons, especially IF the data is its “core advantage”, “competitive data” and so on. Ultimately “asymmetry of data” gives arbitrage opportunity and boosts profit — otherwise also referred to as IP (intellectual property)
  • At the same time, an interesting fact is that every entity wants to know the data about “external”, so that they can react to threats and/or actions for survival. Keeping aside whether its right or wrong, that is how the world works (from philosophical angle) and we have to move on with that i.e. information asymmetry

Data Tensions

The problem (democratized labor metrics) is solvable because we know that payroll companies have the salary data (ADP, Intuit, Workday etc.) for e.g. But

  • Will they really share the data ? They loose information advantage and/or violate fiduciary responsibility
  • Are they allowed to share ? Each client owns the employee’s salary data and client is data owner, and so contractual obligations do not let them share
  • Do they have structured/unstructured data, that follows an industry standard ?
  • Is there an industry standard for payroll data schema ? This aspect of the problem is the easiest to solve from technology standpoint

How do we find the sweet spot between 1 and 2 ?

  • Is there a website/portal/organization that can give us the data ? There are organizations like Gartner/Nielsen etc. that host data-sets they collect from various sources, however they probably share ONLY aggregates , and granular level access to data might be through consulting service engagement. Then there are indexes like Glassdoor, payscale etc. that have collected data from many years using crowd sourcing, and provide a rough aggregate/estimate, that serves as a guidance. But do we have a canonical source of data, that we can quote in our discussions and confidently use them ?
  • The labor market salary information (across the world) is so disintegrated and controversial, that no single organization owns it and keeps track of all data — which is democratically shared and viewed by the beneficiaries (Job seeker is a role that requires this data, to have a bargaining position)
  • There are management consulting & knowledge building companies who in turn have huge data, and keep collecting more as they get engaged in consulting by clients , who venture into new geographies — and they take help from localized data-collection agencies. The data sets owned by such companies are probably rich , but unfortunately , yet again they are core IP (intellectual property)
  • Like KBB (Kelly blue book) index for car prices in the US (free for certain high-level data and paid for granular data) — Is it possible to track labor metrics (by geo, by wage, by ..etc.)?

Symmetric Labor Metricsᴿ

Symmetric Labor Metricsᴿ is an innovative & intelligent platform that helps balance the interests and incentives of various roles in the labor market. Our innovative and intelligent platform tracks data from various sources (public data) and private to give you comprehensive and well-informed labor metrics, so that you are confident

  • In decision making
  • Support decisions with a data driven approach
  • Help reconcile fairness of pay discussions

Data Credibility

We will start with publicly available data for e.g. OFLC performance data, join with other relevant data like SOC (standard occupational codes for jobs), FIPS (for geo county info) etc. The source of data is as credible as it can be , straight from the horses mouth. The data is not exhaustive, but it is a very good proxy for folks who work in the engineering field especially for high skilled labor.

Value

Based on publicly available data , we can extract value out of the Programs and derive metrics/reports that affect decision making for [Entities, People, Government] etc. Some questions that help with actions for various roles

Government

- Which Visa program was effective ? How much effective in numbers ? Should we continue/discontinue/improve/make changes etc. ? 

- Which Geo (states, counties, cities, postal codes, metro areas etc.) are doing well/not well attracting immigration talent ? This has tax implications and we can correlate this with healthiness index, crime index, business index and many more

- How did the visa program do since 2001 to date? What macro-economic factors might explain changes in visa programs over time trend ?

People

- Which Geo should I look for jobs that pay a certain max. salary or above ? Which companies are visa sponsor friendly ?

- Which companies have higher success rate for getting visa approved ?

- How can I bargain with right data (mean, median, mode, standard deviation) for salaries and convince my future employer ? And have a fairer bargaining stance with data, instead of ONLY relying on my previous salary

Entities

- If we need high talented engineers (for e.g.) or employees in larger numbers  - which program should I analyze that incentivizes in the right way to all parties involved ? 

- Which Geo should I set up my next branch of office, so that I have access to highly talented employees?

- How are my competitors in the same industry doing when it comes to competing getting the best of the cream ?

- How much salaries are my competitors paying to various job titles ? How can I have a fairer negotiation discussion with job seekers ?

Framework

A rough framework to visualize is as below

MVP (minimum viable product)

An MVP is ready and a framework for crowd collaboration is available in the github repo . If you are an engineer/developer/hands-on coder, jump in straight into the code

Microservices – Hidden costs

Costs you need to be aware of before doing micro-services (distributed systems)

Languages

  • Hard to share code (because every micro-service might be in a totally different language)
  • Hard to move between teams (except for polyglot engineers)
  • Sometimes fragments the culture (exactly what we don’t want , yet c# programmer might say – I don’t care about your nodejs service)

RPC

  • HTTP/REST gets complicated (what methods, what headers and you have to communicate everything)
  • JSON needs a schema (empty string vs. null vs….type coercion etc. Would be nice to define types)
  • RPCs are slower than PCs

HOW MANY REPOS

  • Many is bad AND one is bad (Find a balance tradeoff) (Uber has 7k-8k repos and they feel they are on the one extreme of the spectrum – too many)

OPERATIONAL

  • What happens when things break ?
  • Can other teams release your software
  • Is automation good enough
  • Understand a service in larger context (Think about how easy it was understanding a monolith and so focusing on understanding the full system working as one is quintessential)

PERFORMANCE

  • Performance is dependent on language tools (Chasing down a performance problem don’t translate across stacks so easily, which might cause friction)
  • Doesn’t matter until it does (sure developer velocity, deliver & don’t worry about efficiency are trade-offs) – Have some kind of minimum SLA , will be very useful
  • Latency of system >= latency of the slowest service (In micro-services, this gets amplified as you fan out) – Use Tracing like Netflix Hystrix ,but are such libraries available in all language stacks ? Cross-language context propagation

LOGGING

  • Need consistent, structured logging (logging adds to latency when the actual service is already getting throttled). Some folks would log anything and everything, so capacity problems might occur

LOAD TESTING

  • Need to test against production
  • Without breaking metrics
  • Preferably all the time
  • All systems need to handle “test” traffic

MIGRATIONS

  • Old stuff has to work
  • What happened to immutable ?

OPEN SOURCE

  • Buy/Build tradeoff is hard
  • Commoditization
  • As long as engineers don’t feel “its my baby” – we are good because sooner or later , anything related to infra, platform will be commiditized by Amazon, Docker, Google, Microsoft et al.

POLITICS

  • Services might allow people to play politics
  • Detecting politics ? (Company > Team > Self) – means, if you violate this i.e if you put your self-interest over team’s, then it is called politics

MicroServices Jargon

It is a common occurrence when we start understanding a domain , that is complex and multi-dimensional, we hear “Terms and Jargon” , from people who practice (or claim to practice) and the ton of information available online. If you are trying to catch up with a running train (maybe because you are trying to do micro-services based on hype or because you genuinely want to know why is it such a rage these days).

There is some learning that can be organized , however I believe that a “Dictionary” aka. Terms and Jargon with sufficient context is extremely crucial for communicating with other people. Below are few that I came across in the context of micro-services.

ThoughtWorks is generally the forerunner online that explains, maintains the content, concepts around micro-services.

The book to read on micro-services – MicroServcies by Sam Newman

Component in MicroService

component is a unit of software that is independently replaceable and upgradeable

Details

MicroService Envy

Some teams are rushing in to adopting microservices without understanding the changes to development, test, and operations that are required to do them well

Details

MicroService Premium

There is a premium for micro-services. Unless the monolith system is complex enough, micro-services is NOT a defacto standard

Details

Conways Law

“Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization’s communication structure.”

Details

Bounded Context

Bounded Context is a central pattern in Domain-Driven Design. It is the focus of DDD’s strategic design section which is all about dealing with large models and teams. DDD deals with large models by dividing them into different Bounded Contexts and being explicit about their interrelationships.

Details

Integration Database

An integration database is a database which acts as the data store for multiple applications, and thus integrates data across these applications. An integration database needs a schema that takes all its client applications into account. The resulting schema is either more general, more complex or both – because it has to unify what should be separate BoundedContexts.

Details

Eventual Consistency

Microservices are distributed systems and encourage Decentralized Data Management (i.e. NO Integration Databases). However maintaining consistency of data across distributed systems is extremely complex and we have to rely on eventual consistency of data. A fundamental concept to understand is CAP theorem

Details

Fallacies of Distributed Computing

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Details

AWS S3 Static Web Site

Goal

To host and serve contents of a static html website using AWS S3.

Pre-Requisites

Have the content ready, with entry point (let’s say index.html) and “optionally” a html to display to user for errors

Steps

  • Copy the folder (full contents) to any bucket in S3
  • Click “Properties” for the bucket
    • Under “Permissions”, click “Add Bucket Policy” and add the below
    • {
      
       "Version": "2012-10-17",
      
       "Statement": [
      
       {
      
       "Sid": "Allow Public Access to All Objects",
      
       "Effect": "Allow",
      
       "Principal": "*",
      
       "Action": "s3:GetObject",
      
       "Resource": "arn:aws:s3:::<replace_bucket_name>/*"
      
       }
      
       ]
      
      }
    • Based on if you want to expose it publicly to everyone, or aws authenticated users etcetera, click “Add Permissions” and check the box against “List” at the minimum
  • Collapse the “Permissions” section
  • Now expand “Static Website Hosting” and check the Radio button “Enable Website hosting”, then in Index Document: <enter index.html or whichever file is your entry point>
  • Similary for Error Document: <enter error.html or whichever file is your error html>
  • Save all changes

Conclusion

It gets as simple as that , hosting a static website on S3. Don’t forget the bucket policy json, that was where I spent most time trying to figure out why I wasn’t able to access

 

Color text – folders and content in shell

Background

I wanted colors for “files and folder names” and also “text content” once I open in vim or vi,  like this in my terminal (either terminal that comes with OSX or iterm2)

needed-colors.png

colors-vi.png

I wasn’t very particular of “which” colors yet, wanted the system to decide, however did not want just black and white.

Simple Way

Turns out, the below two things helped me (it doesn’t matter whether it is the default terminal emulator that comes with mac or you install iterm or any other wrapper)

Since, I use bash shell, the below did the trick

# Set CLICOLOR if you want Ansi Colors in iTerm2 or bash shell 
 export CLICOLOR=1

 # Set colors to match iTerm2 Terminal Colors or bash shell
 export TERM=xterm-256color

Also, I wanted the cursor to be more highlighted – since I use iterm2, did the following

Menu > Profiles > Edit (Default profile) . In “Colors” tab, selected the “yellow” color for Cursor and Cursor Text

cursor-colors.png

 

Directory ls colors

Alternately, if you want the output of ls (i.e. directories to be in different color for e.g. turquoise on black background) , then add this to ~/.bash_profile for bash

# directory ls colors
export CLICOLOR=1 #optional
export LSCOLORS=gxBxhxDxfxhxhxhxhxcxcx

Why AWS Lambda is not “north pole” for compute yet!

Background

It is well known that cloud is going to revolutionize how we think about information systems in general. AWS is leading the revolution and it leads the public cloud movement by leaps and bounds as of date today (Sure Azure and Google are trying to catch up, however they are far far behind). Similarly people like to call private cloud, hybrid cloud etc. None of those matter as much as delivering applications with speed and agility that is increasing day by day.

If you followed AWS Re-invent 2016, one of the largest shows in the tech world this year and Amazon does an amazing job as ever with newer and exciting services (improved versions and newer features).

This gist has the links to all you tube videos.

Compute

There are about 8 videos on Lambda. I watched a couple of them and some of the key points that stuck in my head were:

  1. Compute evolution goes as Bare Metal -> Virtual Machines -> Containers -> Lambda
  2. Serverless architecture (doesn’t necessarily mean no servers, it means someone else is maintaining your server) means lambda is your best choice
  3. And then how some companies totally operate on lambda and get charged in milliseconds and how they kept costs low etc.

Good sell points – and sure enough there is are use-cases where it makes total sense, especially when you are a bunch of developers and all you care about is writing code for an app and leave scalability, reliability, redundancy etc. to somebody else (AWS in case of lambda)

But…

Focus on Bare Metal -> Virtual Machines -> Containers -> Lambda

It is a linear dimension and that is concerning to me !

Let me throw in more context why that doesn’t convey the full picture

  • Have you ever made decision on choosing a technology because of a variable operating linearly
  • Decision sciences research never arrives at the optimum solution looking at one variable, if it were, life and world would be so much easier
  • Any decision solution requires a context, a graph of relationships between various nodes – aka. trade-off decisions , choosing one over the other based on the situation

Why do you say so ?

  • Let’s say you go “all-in” , all eggs in one basket with lambda as a startup and design your app (composite service based app or whatever) using lambda.
  • Do you know the internals of lambda ? The “how” piece
  • Vendor lock-in is a classic strategy of almost any technology vendor
  • The more you base your app/solution on vendor’s service as a blackbox, the more you are giving away your value (Sure that can always be argued as – That’s not my core , hence I will offload it)
  • But really think about it – it is always a balance between ends of a spectrum that we need to think about (in this case ends being – bare metal vs. lambda)

Don’t get me wrong !

  • I am NOT saying we shouldn’t go with cloud – In fact I am a proponent of moving all on-premises to in-cloud anytime. I got my certification too because I believe future is going to be “mostly” cloud.
  • In fact all the positives – agility, scalability, devops problems, access gatekeeping etc. are mostly solved by cloud solutions — So its a no-brainer for sure
  • I am just playing devils advocate here

OK, so what is actionable in your words..

Ah ! Now we are talking. So here is my recommendation as of today in terms of compute power.

  • Choose a container strategy (Yes Docker!) over Lambda.
  • Let the lowest abstraction layer be Docker engine.

Benefits and more balanced approach

  • Anytime you want to move your workload out of AWS, you can do that because you would have maintained your Docker image manifest, docker-compose.yml or distributed application bundles (docker stack) and you OWN all of it.
  • All you would need to do it layer docker engine on a host , cluster them together, use orchestration managers like Kubernetes, Docker Swarm etc.
  • If you are a person who “always plans for the worst” – YET innovate also taking the most risk, I would totally suggest containers over lambda.
  • Docker containers give the best bang for buck at this time as per me.
  • Sure load balancers, reverse proxies, auto scaling , DNS etc. can be chosen out regardless of this decision
  • Are you okay with servers coming up in seconds? (and not obsessive yet just because  lambda fires in milliseconds)
  • Are you okay spending $50 per month (assuming t2.medium server) vs. re-writing all your code to suit lambda requirements and then getting vendor locked-in?
  • Leverage your cohesive team and enable them to work with each other (dev, ops, sysops, testers…etc). No ONE PERSON cannot take the full load – it might sound everyone wants to be Elon Musk – however that is not reality (Now one might argue, if you don’t dream sky, how are you doing to jump in first place. Agreed with that statement, but contrast that with “Before participating in a olympic race, first learn to walk and sprint”). The point being – take a balanced approach and don’t do “all-in” – Strategize so that you have multiple options always
  • All my above recommendations are applicable more to enterprise than a pure start-up

Finally….

  • I actually coded in Lambda and I love it – as a developer , as an individual contributor
  • As a leader, who is responsible for ensuring to deliver an application with a team of members, that come with varied skill sets – I am often torn between the technologies and choosing one over the other
  • Personally, I want to “do it all” by myself – yes I have those instincts
  • As a leader – thinking as one person is not going to build empires. One has to know how to coordinate, leverage diverse skill sets , yet innovate and sail the ship – Yes a leader should do it all !
  • As with anything in real world – it is a trade-off decision and at this point, I don’t think Lambda is a choice over Containers for me

What would make it North Pole for me 🙂

  • Open source Lambda internals (code, set up instructions by pulling together commodity hardware and layering software on top of it – talk networks, storage too!)
  • Let there be competitors for Lambda across languages, stacks, vendors (call it alpha, gamma etc.)
  • As an adopter, I make the final decision of which “serverless” compute.
  • I am not a big fan of monopoly
  • OpenWhisk, Google Cloud Functions, Azure Fabric, hook.io are upcoming — They are still nowhere close to the experience we get from Lambda.
  • Give me oligopoly !

 

Update

As of 01/31/2017 , there are enough options for serverless now.

Fission IO

Iron IO

Search online for more 🙂

git push – fatal: unable to access The requested URL returned error: 403

Background

If you have created a new repo in github and follow the initial instructions (as below) that github recommends, sometimes you might get the below error on the last command (git push)

fatal: unable to access ‘https://github.com/machzqcq/service_virtualization.git/&#8217;: The requested URL returned error: 403

echo "# service_virtualization" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/machzqcq/service_virtualization.git
git push -u origin master

 

Solution

git remote remove origin
git remote add origin https://machzqcq@github.com/machzqcq/service_virtualization.git
git push origin master
Password for 'https://machzqcq@github.com':
Counting objects: 14, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (13/13), done.
Writing objects: 100% (14/14), 3.63 KiB | 0 bytes/s, done.
Total 14 (delta 0), reused 0 (delta 0)
To https://machzqcq@github.com/machzqcq/service_virtualization.git
 * [new branch] master -> master

 

 

 

 

 

Jenkins+Nexus for database backup restore

Background

Backup and Restore of databases is generally an operations domain and folks who might have worked in that space know how critical BCDR (Business Continuity Disaster Recovery) plans are – planning for the worst. Traditionally failover to a new data center might mean – Backup and restore databases, DNS mappings switched and bring the application services up as quickly as possible to ensure Business Continuity. Databases are generally a crucial piece of many applications built over the past few decades and focus was mainly on restoring database backups (that are taken at a regular interval).

Also an enterprise team manages BCDR process horizontally and do NOT necessarily cater to needs and High Availability of “Applications”. It is generally left to the application development/operations team to figure out once database and network is switched over. And hence restoring configuration settings, application cache refreshes and dependencies that application/product requires happens in adhoc ways. This is the still the reality in many organizations because of gap between Dev and Ops.

With Continuous Integration/Delivery and DevOps principles (coupled with 12-Factor app concepts), the ownership of application is moving more towards development teams aka. verticals (Think of it as  decentralization of ownership in some ways). Tools like Docker, Jenkins, Nexus, Ansible etc. are paving the way for more independent and self-serviced capabilities in the hands of a developer. At the same time, leaps of optimization is happening in the way application development is moving away from monolithic towards micro-services. Innovation is happening in bursts in every tool chain that we can think of towards the goal of more agility , self-service, faster distributed systems model.

Need

I have been a big proponent of Automation and certain tools like Nexus, Jenkins have been part of my proposition to get started quickly and realize the benefits of CI/CD. So I was working with an application recently, that was angular front end, expressjs api layer and backend neo4j graph database. While it was very quick and easy to set up CI jobs in jenkins for UI and API layer, I started thinking what would make best sense for database layer. Btw, the entire application is dockerized (docker-compose) and can be started with one command (including default data seeded).

Since the application requirements keep changing and based on how we collect, analyse , transform data that feeds into application, the schema also changes based on feedback loops. Specific to neo4j, I wanted a way to backup and restore neo4j db as part of my development feedback loops.

Decision Equation

  • Backup neo4j data
  • Restore it on-demand (Clustering was not yet required)

Constraints

  • I did not want to invest in neo4j enterprise edition (yet)
  • Use neo4j community edition
  • Not rely on heavy enterprise level centralized tools backup tools (yet)

Variables

  • Jenkins server
  • Nexus
  • Ability to spin up VMs on-demand

So as you can see, I wanted to make use of existing tools as much as possible and get to a point that application can make money and then go expand in investing on licensing etc.

Solution/Workaround

So after some digging around, I narrowed to the following salient options

  1. neo4j stores data in $NEO4J_HOME/data/ (and other files) , so that makes it easier to separate out data
  2. neo4j stores authentication information in $NEO4j_HOME/data/dbms (I was not particular about auth because it is still development phase). If I forgot the password, I would nuke ‘dbms’ folder , and neo4j resets default credentials to neo4j/neo4j
  3. I had to come up with a naming scheme for backup folders everytime I archive the data folder
  4. Decide where to store the folder (I wanted the backups to be accessible without restraints like enterprise firewalls etc.)
  5. While restoring , I needed option to select which backup to restore
  6. Finally all of the above should be automated as much as possible

 

I chose Nexus

  • neo4j data folder was ~4mb, so I don’t want huge storage systems (Amazon S3 was an option I considered however Nexus can hold archives in Snapshots and Releases folder)
  • Until we decide on neo4j enterprise edition, I wanted to have fault tolerance in place and needed a storage location
  • Nexus has been proven to be nimble , reliable and resilient solution for dependency management needs for years

I chose Jenkins

  • Any quick automation script can be put together and somebody has already done it in Jenkins world (or there is a plugin to do it)
  • I needed a cron and setting up one in Jenkins is a piece of cake
  • Backup job can be a simple shell script that “tar -cvzf /data” and then upload to nexus
  • Restore job is again a simple Jenkins job where restore package is specified as parameter to the Jenkins job

 

So the overall flow is:

Jenkins Backup Job

connect to neo4j server -> Backup neo4j data dir (tar) ->upload to Nexus (append jenkins build number in name)-> run this job at a certain frequency.

A jenkins pipeline job looks as below

jenkins-neo4j-backup

The pipeline script looks like this

stage 'Check server is up'

node('<jenkins-slave>'){
 sh '''status_code=$(curl --write-out %{http_code} --silent --output /dev/null http://localhost:7474)
case "$status_code" in
 200) echo \'neo4j is running...\'
 ;;
 *) echo \'Fail!\'
 exit 1
 ;;
esac'''
}

stage 'Archive data & logs'

node('jenkins-slave') {
sh "cd $BACKUP_DIR && tar -cvzf neo4j-data-${env.BUILD_ID}.tar.gz data logs"
sh "ls -l $BACKUP_DIR"
}

stage 'Upload to Nexus'
node('jenkins-slave')
{
 sh "curl --upload-file $BACKUP_DIR/mds-data-${env.BUILD_ID}.tar.gz -u $NEXUS_USERNAME:$NEXUS_PASSWORD -v https://nexus.server.com/nexus/service/local/repositories/releases/content-NEO/NEO4j_data_${env.BUILD_NUMBER}"
}
slackSend channel: '#channel, color: '#008000', message: "backup successful https://nexus.server.com/nexus/service/local/repositories/releases/content-NEO/NEO4j_data_${env.BUILD_NUMBER}. 

stage 'Purge archive'
node('jenkins-slave'){
 sh "rm -fr $BACKUP_DIR/neo-data-${env.BUILD_ID}.tar.gz"
}

 

Jenkins Restore Job

Get the list of all backup archives -> Download them from nexus -> connect to neo4j server -> clean up and restore data from backup archive -> restart neo4j server

The restore job was interesting because I had to first get the list of all available nexus packages to restore and then execute the restore job.

Looking up online – Jenkins has a neat way of doing this with “Extended Choice Parameter” plugin where the values of the choice can be the output of groovy script.

So I had to figure out how to get the list of names of all Nexus backup packages. And thankfully Nexus already exposes REST api to do that using

curl -X GET -u $NEXUS_USERNAME:$NEXUS_PASSWORD https://nexus.server.com/nexus/service/local/repositories/releases/content-NEO/

The above return an XML Node Tree and we can parse the tree to retrieve package names. But I had to do it with Groovy. Here is the groovy script that does it

 

def response = new XmlParser().parseText(["curl","-X","GET","-u","$NEXUS_USERNAME:$NEXUS_PASSWORD","https://nexus.server.com/nexus/service/local/repositories/releases/content-NEO/"].execute().text).data.'**'.findAll{ node-> node.name() == 'text' }*.text()

The corresponding parameter in Jenkins job would look like this

extended_choice_groovy

When Build with Parameters is clicked , the groovy script gets executed and the list of all possible restoration packages shows up as follows:

restore-neo4j-build-params

The restore pipeline script is as follows:

stage 'retrieve package'
node('jenkins-slave')
{
 sh "wget --user $NEXUS_USERNAME --password $NEXUS_PASSWORD https://nexus.server.com/nexus/content/repositories/releases/-NEO/$RESTORATION_PACKAGE"
}
stage 'prepare current dir'
node('jenkins-slave')
{
 sh '''
 cd $BACKUP_DIR && 
 if [ -d data && -d logs]; then 
 tar -cvzf neo-data-current.tar.gz data logs
 fi '''
 sh "cd $BACKUP_DIR && rm -fr data logs"
}
stage 'restore'
node('jenkins-slave')
{
 sh "tar -xvzf $RESTORATION_PACKAGE -C /data/neo4j"
}
stage 'start neo4j & smoke test'

The pipeline view is

restore-pipeline

 

Summary

After implementing the above, I am confident now that whatever data comes into the system is backed up and I can restore the data to a point in time history without having to go through too many hoops.

Note: This is NOT a production solution, yet while dealing with innovation projects – where one is inundated with so many moving parts (budget, MVP, configuration management, less dependencies on other teams, faster delivery etc.), having a backup/restore automation for data is ONE LESS thing to worry about and ONE MORE thing to feel confident about

 

Docker for Beginners

ship-iamge

Background:

If you are a programmer/techie, chances are you shouldn’t have missed on hearing about Container/micro-services and Docker. Containers have been in use for more than a decade, however it became very popular with Docker coming up with a mechanism (container format), where it is easier to adopt container concepts. Google has been using containers for years and companies across plethora of industries (including financial) are adopting containers.

Containers date back to at least the year 2000 and FreeBSD Jails. Oracle Solaris also has a similar concept called Zones while companies such as Parallels, Google, and Docker have been working in such open-source projects as OpenVZ and LXC (Linux Containers) to make containers work well and securely.

Indeed, few of you know it, but most of you have been using containers for years. Google has its own open-source, container technology lmctfy (Let Me Contain That For You). Anytime you use some of Google functionality — Search, Gmail, Google Docs, whatever — you’re issued a new container

Containers vs. VMs

Both containers and virtual machines are highly portable, but in different ways. For virtual machines, the portability is between systems running the same hypervisor (usually VMware’s ESX, Microsoft’s Hyper-V, or open source Zen or KVM). Containers don’t need a hypervisor, since they’re bound to a certain version of an operating system. But an application in a container can move wherever there’s a copy of that operating system available

Containers and VMs are similar in their goals: to isolate an application and its dependencies into a self-contained unit that can run anywhere.Moreover, containers and VMs remove the need for physical hardware, allowing for more efficient use of computing resources, both in terms of energy consumption and cost effectiveness.

  • Containers are newer and have massive growth potential whereas VMs have been there for many years and proven to be stable
  • Containers boot in a fraction of second compared to VMs (everything else being constant from hardware perspective)
  • Containers are proven to work on massive scale (Google search)
  • Containers and micro-service architecture align perfectly
  • Security is yet to be proven with containers (well security is a never ending battle in any context for that matter)

 

From an architectural approach, some of the differences are:

Virtual Machines

A VM is essentially an emulation of a real computer that executes programs like a real computer. VMs run on top of a physical machine using a“hypervisor”. A hypervisor, in turn, runs on either a host machine or on“bare-metal”.

Let’s unpack the jargon:

hypervisor is a piece of software, firmware, or hardware that VMs run on top of. The hypervisors themselves run on physical computers, referred to as the “host machine”. The host machine provides the VMs with resources, including RAM and CPU. These resources are divided between VMs and can be distributed as you see fit. So if one VM is running a more resource heavy application, you might allocate more resources to that one than the other VMs running on the same host machine.

The VM that is running on the host machine (again, using a hypervisor) is also often called a “guest machine.” This guest machine contains both the application and whatever it needs to run that application (e.g. system binaries and libraries). It also carries an entire virtualized hardware stack of its own, including virtualized network adapters, storage, and CPU — which means it also has its own full-fledged guest operating system. From the inside, the guest machine behaves as its own unit with its own dedicated resources. From the outside, we know that it’s a VM — sharing resources provided by the host machine.

As mentioned above, a guest machine can run on either a hosted hypervisor or a bare-metal hypervisor. There are some important differences between them.

First off, a hosted virtualization hypervisor runs on the operating system of the host machine. For example, a computer running OSX can have a VM (e.g. VirtualBox or VMware Workstation 8) installed on top of that OS. The VM doesn’t have direct access to hardware, so it has to go through the host operating system (in our case, the Mac’s OSX).

The benefit of a hosted hypervisor is that the underlying hardware is less important. The host’s operating system is responsible for the hardware drivers instead of the hypervisor itself, and is therefore considered to have more “hardware compatibility.” On the other hand, this additional layer in between the hardware and the hypervisor creates more resource overhead, which lowers the performance of the VM.

A bare metal hypervisor environment tackles the performance issue by installing on and running from the host machine’s hardware. Because it interfaces directly with the underlying hardware, it doesn’t need a host operating system to run on. In this case, the first thing installed on a host machine’s server as the operating system will be the hypervisor. Unlike the hosted hypervisor, a bare-metal hypervisor has its own device drivers and interacts with each component directly for any I/O, processing, or OS-specific tasks. This results in better performance, scalability, and stability.The tradeoff here is that hardware compatibility is limited because the hypervisor can only have so many device drivers built into it.

After all this talk about hypervisors, you might be wondering why we need this additional “hypervisor” layer in between the VM and the host machine at all.

Well, since the VM has a virtual operating system of its own, the hypervisor plays an essential role in providing the VMs with a platform to manage and execute this guest operating system. It allows for host computers to share their resources amongst the virtual machines that are running as guests on top of them.

dockervsvm

image2016-4-16 11:57:32.png

As you can see in the diagram, VMs package up the virtual hardware, a kernel (i.e. OS) and user space for each new VM.

Container

Unlike a VM which provides hardware virtualization, a container provides operating-system-level virtualization by abstracting the “user space”. You’ll see what I mean as we unpack the term container.

For all intent and purposes, containers look like a VM. For example, they have private space for processing, can execute commands as root, have a private network interface and IP address, allow custom routes and iptable rules, can mount file systems, and etc.

The one big difference between containers and VMs is that containers *share* the host system’s kernel with other containers.

containerimage2016-4-16 11:59:11.png

This diagram shows you that containers package up just the user space, and not the kernel or virtual hardware like a VM does. Each container gets its own isolated user space to allow multiple containers to run on a single host machine. We can see that all the operating system level architecture is being shared across containers. The only parts that are created from scratch are the bins and libs. This is what makes containers so lightweight.

What is Docker?

Docker brings several new things to the table that the earlier technologies didn’t. The first is that it’s made containers easier and safer to deploy and use than previous approaches. In addition, because Docker’s partnering with the other container powers, including Canonical, Google, Red Hat, and Parallels, on its key open-source component libcontainer , it’s brought much-needed standardization to containers. At the same time, developers can use Docker to pack, ship, and run any application as a lightweight, portable, self sufficient LXC container that can run virtually anywhere.

Some reasons why Docker has gained popularity over other container based technologies.

1. Ease of use: Docker has made it much easier for anyone — developers, systems admins, architects and others — to take advantage of containers in order to quickly build and test portable applications. It allows anyone to package an application on their laptop, which in turn can run unmodified on any public cloud, private cloud, or even bare metal. The expectation is: “build once, run anywhere.”

 

2. Speed: Docker containers are very lightweight and fast. Since containers are just sandboxed environments running on the kernel, they take up fewer resources. You can create and run a Docker container in seconds, compared to VMs which might take longer because they have to boot up a full virtual operating system every time.

 

3. Docker Hub/Image Registry: Docker users also benefit from the increasingly rich ecosystem of Docker Hub, which you can think of as an “app store for Docker images.” Docker Hub has tens of thousands of public images created by the community that are readily available for use. It’s incredibly easy to search for images that meet your needs, ready to pull down and use with little-to-no modification. Sonatype Nexus (Nexus User Guide) also now support Docker container registry type.

 

4. Modularity and Scalability: Docker makes it easy to break out your application’s functionality into individual containers. For example, you might have your Postgres database running in one container and your Redis server in another while your Node.js app is in another. With Docker, it’s become easier to link these containers together to create your application, making it easy to scale or update components independently in the future.

 

Docker Concepts

If you want to deep dive into technology right away and you have a background, then Docker Containers would give  you the most benefit (Scroll down after the table)

If you are really new , then follow along.

linux-vs-windows-suppport

** Docker has strong roots in *nix. Windows started aligning with docker very recently by opening up the kernel with Windows server 2016

Docker Engine

Docker engine is the layer on which Docker runs. It’s a lightweight runtime and tooling that manages containers, images, builds, and more. It runs natively on Linux systems and is made up of:

1. A Docker Daemon that runs in the host computer.
2. A Docker Client that then communicates with the Docker Daemon to execute commands.
3. A REST API for interacting with the Docker Daemon remotely.

Docker Client

The Docker Client is what you, as the end-user of Docker, communicate with. Think of it as the UI for Docker. For example, when you run…

docker build trusty/ubuntu

The Docker daemon is what actually executes commands sent to the Docker Client — like building, running, and distributing your containers. The Docker Daemon runs on the host machine, but as a user, you never communicate directly with the Daemon. The Docker Client can run on the host machine as well, but it’s not required to. It can run on a different machine and communicate with the Docker Daemon that’s running on the host machine.you are communicating to the Docker Client, which then communicates your instructions to the Docker Daemon.

Dockerfile

A Dockerfile is where you write the instructions to build a Docker image.These instructions can be:

  • RUN apt-get y install some-package: to install a software package
  • EXPOSE 8000: to expose a port
  • ENV ANT_HOME /usr/local/apache-ant to pass an environment variable

and so forth. Once you’ve got your Dockerfile set up, you can use the docker build command to build an image from it. Here’s an example of a Dockerfile.

Docker Image

Images are read-only templates that you build from a set of instructions written in your Dockerfile. Images define both what you want your packaged application and its dependencies to look like *and* what processes to run when it’s launched.

The Docker image is built using a Dockerfile. Each instruction in the Dockerfile adds a new “layer” to the image, with layers representing a portion of the images file system that either adds to or replaces the layer below it. Layers are key to Docker’s lightweight yet powerful structure.Docker uses a Union File System to achieve this:

Union File Systems

Docker uses Union File Systems to build up an image. You can think of a Union File System as a stackable file system, meaning files and directories of separate file systems (known as branches) can be transparently overlaid to form a single file system.

The contents of directories which have the same path within the overlaid branches are seen as a single merged directory, which avoids the need to create separate copies of each layer. Instead, they can all be given pointers to the same resource; when certain layers need to be modified, it’ll create a copy and modify a local copy, leaving the original unchanged. That’s how file systems can *appear* writable without actually allowing writes. (In other words, a “copy-on-write” system.)

Layered systems offer two main benefits:

1. Duplication-free: layers help avoid duplicating a complete set of files every time you use an image to create and run a new container, making instantiation of docker containers very fast and cheap.
2. Layer segregation: Making a change is much faster — when you change an image, Docker only propagates the updates to the layer that was changed.

Volumes

Volumes are the “data” part of a container, initialized when a container is created. Volumes allow you to persist and share a container’s data. Data volumes are separate from the default Union File System and exist as normal directories and files on the host filesystem. So, even if you destroy, update, or rebuild your container, the data volumes will remain untouched. When you want to update a volume, you make changes to it directly. (As an added bonus, data volumes can be shared and reused among multiple containers, which is pretty neat.)

Docker Containers

A Docker container, as discussed above, wraps an application’s software into an invisible box with everything the application needs to run. That includes the operating system, application code, runtime, system tools, system libraries, and etc. Docker containers are built off Docker images. Since images are read-only, Docker adds a read-write file system over the read-only file system of the image to create a container.

layers

Moreover, then creating the container, Docker creates a network interface so that the container can talk to the local host, attaches an available IP address to the container, and executes the process that you specified to run your application when defining the image.

Once you’ve successfully created a container, you can then run it in any environment without having to make changes.

Potential Docker EcoSystem

Docker_ecosystem

 

References

http://www.zdnet.com/article/what-is-docker-and-why-is-it-so-darn-popular/

http://www.informationweek.com/strategic-cio/it-strategy/containers-explained-9-essentials-you-need-to-know/a/d-id/1318961

https://www.linkedin.com/pulse/beginner-friendly-intro-containers-vm-docker-preethi-kasireddy

https://azure.microsoft.com/en-us/blog/containers-docker-windows-and-trends/

 

Ansible Tower

Reading time: 3 min

Background

It is assumed that you have ansible working knowledge (control server, remote node, inventory, modules, ansible cfg etc.). So after working with ansible for a while and executing the playbooks on inventories (target hosts), one starts to feel the need for a web UI that helps get away from the chore. I use the word ‘chore’ only after one acquires the domain knowledge of understanding ansible internals. And that is possible only when using ansible from command line for quite some time and understanding the connecting pieces and workflows. That said, in an enterprise type of setting RBAC (role based access control) around playbooks, inventories, credentials to access the inventories, teams, groups will be needed. Hence Ansible Tower!

What is Ansible Tower

Centralize and control your Ansible infrastructure with a visual dashboard, role-based access control, job scheduling, and graphical inventory management. Tower’s REST API and CLI make it easy to embed Tower into existing tools and processes. From horse’s mouth

Product Customizations

There are different tower editions and prices for each edition.

The license file contains more details. In this blog, we will go ahead and use the Vagrant ansible image using the free tower trial. Follow the instructions on this page Give your details and Red Hat will send you a license file (.txt file) and in my case the license allowed usage of max.  10 nodes i.e. I can kick off ansible playbooks against an inventory of hosts that can contain up to 10 nodes.

Vagrant

Anyways after you follow the instructions to get the vagrant image, your vagrantfile should be something like the below.

# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
 # The most common configuration options are documented and commented below.
 # For a complete reference, please see the online documentation at
 # https://docs.vagrantup.com.

 # Every Vagrant development environment requires a box. You can search for
 # boxes at https://atlas.hashicorp.com/search.
 config.vm.box = "ansible/tower"

There will be more commented out, but the main line is config.vm.box.

Place this Vagrant file in any directory and start the vm using ‘vagrant up’ command. Of course it is assumed that you have already installed Vagrant on your OS.

Access Tower

To get ansible web url and username/password, ssh into your vagrant ansible box as below. If private key is asked, just hit enter

 

F:\ansible\ansible-tower>vagrant ssh
Enter passphrase for key 'F:/ansible/ansible-tower/.vagrant/machines/default/virtualbox/private_key':
vagrant@127.0.0.1's password:
Last login: Tue Jun 14 01:30:48 2016 from gateway

 Welcome to Ansible Tower!

 Log into the web interface here:

 https://10.42.0.42/

 Username: admin
 Password: <your_password>

 The documentation for Ansible Tower is available here:

 http://www.ansible.com/tower/

 For help, email support@ansible.com

Now hit the url that shows up on your screen and enter the credentails. The first time, you would have to import the license and in my case , it looks as below.

ansible_license

 

Once you imported the license successfully, follow the below steps to verify your tower is good to go

Test it Out!

Click Settings and set up each of the below

Organizations

ansible_organizations

Users

ansible_users

Credentials

root is the user to execute playbook on target host. The reason we chose root was because the sample playbook we will run at the end requires root (install ntp server and start it)

ansible_credentails

 

Inventories

Set up a sample_prod inventory

ansible_inventories

Add hosts to inventory

ansible_inventory_hosts

ansible_hosts_inventory_db1

In my case, I upped three vagrant machines using the Vagrantfile at this location .So I had three machines with the ip’s in the vagrant file and I used the db and web ip’s in the above hosts values.

Projects

Now set up a sample project and point that to Git repo – https://github.com/ansible/ansible-examples/

ansible_projects

Job Templates

Fill in the values as below

ansible_job_template

Instantiate a Job

Click the icon against the job template, so that it kicks off the playbook against db1 and web1 machines

ansible_job_template_run

 

Job Runs

ansible_jobs_runs

Console Output

ansible_job_stdout

ansible_job_run

Conclusion

That is it! Importing a playbook from online, creating a job template and running the job against a set of inventory we already defined. Thats how easy Ansible can get.

Ther e are intricacies and more details to be able to apply Ansible tower to your already set up playbooks and identify exact needs how Ansible tower might help solve those.

However this post is to get you started quickly.

 

AWS Sandbox ?

Hit this page and pick the ami based on your aws region (e.g. in us-east-1, ami-a013f9cd)

ansible_ami

  • Choose t1.micro if you want to experiment with free tier. Go with defaults, review and launch. In the security groups, be sure to allow inbound 443 and 80. You can access ansible tower web ui using either of http 80 or https 443
  • ssh into the machine with your key pair and notice that apache2 has virtual hosts created on 443 and 80 – This is the inbound port on which Tower serves all its web ui content.
  • You might have to ssh once into the box to get the admin username and credential (similar screenshot which you get like the Vagrant machine above)
  • Hit the url https://<public ip/dns> or http://<public ips/dns> (This URL will also be displayed in the minute of the day motd when ssh’d into the machine)
  • The rest of the initial steps of importing license and subsequent steps are the same as above. Of course you have to have target hosts that can be reached by this ansible tower server to be able to execute playbooks (Above we used vagrant vm’s for setting up experimental lab – you can spin up more ec2 instances to experiment)

 

Feedback: pradeep@seleniumframework.com