Viasat Tech Blog Mon, 06 Aug 2018 21:22:03 +0000 en-US hourly 1 Viasat Tech Blog 32 32 CI/CD Pipeline for Maintaining a Stable and Customizable Kubernetes Mon, 06 Aug 2018 17:30:47 +0000 I am an intern in the Global Infrastructure group in the Austin, TX office this summer. Our group is responsible for cloud and infrastructure engineering projects that allow Viasat’s development teams to move fast and deploy software efficiently. Containers and container technologies are largely responsible for that and they have exploded in popularity the last few years.

Containers are a form of packaging in which applications can be abstracted from the environment in which they run. But when you start to run a large number of containers in production you have to deal with the underlying complexity of having to maintain individual machines, deal with uptime, and move resources around. That is where Kubernetes comes in.

What is Kubernetes?

Kubernetes is an open source production-scale container orchestration tool that allows for automating deployment, as well as scaling and operating of application containers across clusters of hosts. It can run any containerized application and it has become the industry standard for deploying containers into production.

So what is the problem then? It sounds like Kubernetes is the perfect solution!

Well, as great as Kubernetes is, using it in certain environments brings about unique concerns. Working with open-source projects is nice because they do not cost money, but when there are bugs, there is not anyone that you can call, complain to, or pay to fix them. Kubernetes is released about once a quarter, with maintenance releases every couple weeks. Viasat cannot wait that long for patches when we are running Kubernetes on important workloads. Additionally, Viasat may want to extend the project, but not lose the benefit of constant improvements from the open-source contributors. Thus, it is evident that using Kubernetes in its current state is not always sufficient for our needs, so it would be valuable for Viasat to have a stable, continuously updated copy of the orchestration tool.

The Art of CI/CD

Problems like these are why the practice of continuous integration/continuous deployment (or CI/CD) has become so popular. CI/CD pipelines increase an organization’s ability to deliver applications and services at a higher velocity by integrating code changes more frequently and reliably. This allows the organization to compete more effectively in the market.

By creating a CI/CD pipeline for maintaining a stable version of Kubernetes, we would be able give Viasat a competitive advantage by removing any overhead that comes with confidently deploying Kubernetes into real-world environments.

Architecture of the Pipeline


  • We are maintaining Viasat’s version of Kubernetes with Github. We will denote this version of Kubernetes as Viasat Hyperkube because it is Viasat’s copy and the image is pushed automatically on each update, as we will illustrate.
  • Jenkins an open-source automation tool used often used for CI/CD, whose Pipeline function can be leveraged to automate complicated tasks.
  • Docker is an open-source software containerization tool. Images are snapshots of containers that are not yet running. A Docker registry is a repository that stores your Docker images.

Our cloning pipeline is set to run first. This pipeline compares the upstream Kubernetes repository, fetches changes from upstream and attempts to merge them into the Viasat Hyperkube repository. If a merge conflict arises, the changes to resolve are reported to the administrator with a series of steps to resolve the conflict. We would not want to automate this aspect of the process because it is necessary for a Viasat employee to decide which changes should persist and which should be discarded. This is a multi-branch pipeline, so the above process runs for each configured branch.

Our build pipeline is set to run after the clone pipeline terminates. This pipeline compares the most recent commit ID on the remote Viasat Hyperkube repository with the most recent processed commit ID locally on one of our Jenkins slaves (an AWS EC2 instance). If the commit IDs are different, the further steps in the pipeline are triggered, otherwise the build is aborted because there is no reason to build a copy that we already have a built image for. From there, the pipeline clones the latest version of Viasat Hyperkube locally and builds the binary (by binary, we simply mean a form of Kubernetes that is ready to use). The binary is then packed as a Docker image and given a tag.

At this point, the build pipeline triggers the test pipeline to run end-to-end conformance tests on the image. The results of these tests inform us about whether or not the image is stable enough to confidently deploy into production. Control is then given back to the build pipeline. If the image passed all of the end-to-end tests, that image is pushed to our Docker registry. This registry maintains a handful of the most recent built images of Viasat Hyperkube that passed all of the tests. In the event that any tests fail, a Test Results Analyzer plugin on Jenkins can be used to diagnose any failures. Along with that, we are maintaining several of the most recent test logs containing error messages in AWS S3 to also assist in resolving any test failures. The administrator will receive a notification from Jenkins explaining whether or not the image passed all the tests and if the image was pushed to the registry (including a URL), including an attached file of the test log.

Day-to-Day Usage

 With our pipeline, involvement from any Viasat employee is largely removed (apart from manually resolving merge conflicts). Options such as which branches of Kubernetes to clone and the number of stable images to store in our registry are parameterized and can be modified at any time. A Viasat employee can come into the office in the morning and they automatically have a stable version of Viasat’s Kubernetes that is ready to deploy into production!


 As demonstrated by this project, we learned how to use a handful of cutting-edge technologies as part of our summer internship here at Viasat. Not only are Docker, Jenkins, and Kubernetes highly sought-after skills, but the applications of them are nearly endless.

But we also learned a lot about the type of company we want to work for post-graduation. We really appreciate the level of autonomy that is granted to us; ranging from managing our own working hours to making our own technical decisions within our project. Each time our supervisor Piotr Siwczak made a suggestion to our team about how to tackle a particular problem, he would immediately follow it up with “But if you think that there is a better way you should definitely try that.” I think that this idea of blazing your own trail within the company is a philosophy well-fostered at Viasat. And I would encourage you to come experience it for yourself.

So if you are looking to intern somewhere where you can grow professionally, fly to California for a hackathon and social events with all expenses paid, present your impactful work to executives, pick up cutting-edge tech skills… I could go on.

Check out: Viasat Internships

Note: Austin worked in the Austin, TX office alongside Bhavani Balasubramanyam and Mittal Jethwa. Bhavani is a second-year Masters student at Arizona State University, Mittal is a second-year Masters student at San Diego State University, and Austin is a junior at Rice University.


]]> 0
Launching a Virtual Ground System Network – The Bridge to Being an Internet “Experience” Provider Tue, 17 Apr 2018 21:10:15 +0000 Like most developers, we think what we are developing is the most important part of our system.  Our infrastructure service is the center of the universe; everything else revolves around it. Ok in reality, Viasat’s brand-new satellite broadband service is the main thing, the virtual network is built to support it. But the virtual network is very essential to the whole customer experience and a pathway for Viasat to create a planet-wide broadband network.

In today’s world connectivity has become essential but the traditional service provider networks are very slow to change. The people using the networks learn to adapt to the shortcomings. That is human nature, most humans learn to adapt and work with awkward networks. It’s the same as people demanding for “faster horses” instead of cars from Henry Ford.

Viasat is different, we design and operate our own technology. That allows us to innovate and build products that benefit Viasat and allows us to transfer the benefits to our customers.


Our new ground segment network is the world’s first production end to end virtual network. It is completely run using DevOps methodology.  Most other service providers and even people within our organization, think we were crazy to embark on that journey.


Seven Steps to a Virtual Service provider network

Here are the 7 steps to create a ground system for a satellite that is completely virtual in our private cloud.

 Step 1: Setting up the hardware

This step isn’t really key to this article, so I’ll just summarize to say that operating our private cloud includes deploying and managing all the equipment we need in our data centers.  This step ensures we have all the necessary equipment installed and ready for provisioning.

Step 2: Powering up and base provisioning

We use ZTP (Zero touch provisioning) for all of the hardware in the data center. Everything from the switches to the computers boot off and get the right software (OS) and base configuration from the provisioning servers.

 Step 3: Assign devices personality and provision

After the ZTP finishes, the auto-provisioning process starts, this continues for the lifetime of the hardware. Basically, the auto-provisioning agent’s job is to make sure that the configuration on the underlying hardware matches the model.


We use Ansible and our own developed vendor drivers to abstract the vendor specific methods from the actual model.

After this point all of the computes are setup for the IaaS layer install. The routers are connected to the backbone and switches are setup for all of the datacenter connectivity.

Step 4: Setup IaaS layer

The next step is to lay down the IaaS layer – we have our own – ViaStack. It’s a custom IaaS layer that supports a service provider use case of a “transient data center”. I had an earlier blog post that talks about transient data centers :



Step 5: Setup Virtual networking and dynamic service chaining

This is the final stage of infrastructure. Here is where we create the layer that makes virtual networking possible. Some of it is covered in this blog post:

After all of the software is installed. We create layers of the network (vnets – Virtual Networks) that connect pieces together.

Step 6: Deploy applications

At this stage – all of our internal as well as vendor apps are ready to be deployed. These range from virtual routers, traffic conditioners to actual Satellite MAC layers.

The applications all do blue/green deployments. Basically, this approach allows a new virtual network application cluster be created while the one is still in service. Then we can choose to bring the new cluster in service in stages. This way we do minimal disruption to our customers.

We also use the concept of network-templates, the same templates are used in all stages of the promotion pipeline.

Step 7: Activate the network – add subscribers

Now the network is ready to run. We can add millions of subscribers to this ground segment. Monitoring is still something I want to cover. It’s a big topic on its own.


Why are networks that are virtual better for customer experience you ask?

Virtual networks allow insertion of new network functions dynamically: It’s like loading apps on your phone to personalize it. Network functions are like apps, some people need a certain set verses other and of course, they always want the latest one.

Allowing new network functions to be dynamically inserted is huge. It allows us to create apps for the network. For example, an app that can optimize the video experience and give our customers virtually unlimited video viewing capability. In the future we will have more apps that allow customers to have more activate participation in their service.

Network virtualization also allows for new levels of stability not possible earlier. We can do smaller upgrades with quick rollbacks if things go wrong. We can do hitless updates during the day vs taking the service down in the middle of the night when our customers are up late binge watching their favorite show.

Having a virtual network means that we can connect devices around the world in a very dynamic fashion. A passenger on an airplane going from the US to Europe can not only get fast internet service all the way but they can get it even when we cross satellite boundaries.


]]> 0
Lift: A General Purpose Architecture for Scalable, Realtime Machine Learning Thu, 25 Jan 2018 00:26:24 +0000 In the last few years, we’ve witnessed explosive growth in the role machine learning (ML) plays in technology. Making good predictions from data has always been important in our industry, but modern machine learning techniques allow us to be much more systematic. However, this wealth of new ML algorithms and services present new challenges for software developers.

For example:

  • Should we use managed ML services from Cloud providers or build our own?
  • How can we architect our infrastructure to have the most flexibility while we refine our ML system?
  • Do we train our ML algorithms in batch or perform streaming updates?
  • How fast can we make predictions?
  • How do we ensure our system scales as we add more users?

The Acceleration Research and Technology (ART) group faced these questions, and many others, in the course of building an ML service that predicts the structure of the internet. In this post, I’ll give a brief overview of Lift, an architecture we developed at ViaSat for building scalable ML services, and techniques we’ve found useful for managing those services.

Our Design Goals

The space of ML technologies is very diverse, with many different models of computation and performance guarantees. Before we started developing Lift, it was important to specify what we wanted to accomplish:

  • Realtime: When fully deployed, our system will ingest thousands of records every minute. We want this new data to be immediately incorporated into our mathematical models.
  • Scalable and High Availability: We want our service to be arbitrarily scalable, capable of taking requests from millions of users at the very least.
  • Fast Predictions: In many applications, predictions are only useful if we can reply quickly (<10 milliseconds from receiving a request).
  • Flexible: We’re always improving our algorithms so we want an architecture that lets us get our best discoveries into production quickly.
  • Reusable: Many groups need tools to continuously extract predictions from vast flows of data (for example, detecting and eliminating malware that might be sent over our network). We want to save labor by building a platform where developers can focus on solving their core ML problem and leave the scale/service/infrastructure issues to us.

To meet these goals, we’ve developed Lift, an architecture for scalable, realtime ML services, and a collection of tools and techniques for creating and managing those services.

The Lift Architecture

Lift Architecture

Lift decomposes the machine learning process into three phases, Enrich, Train, and Infer, each supported by its own cluster of servers. These services have been implemented on AWS (Amazon Web Services) and leverage AWS’s managed services to achieve scale.

The Enrich cluster is responsible for receiving feedback records containing the raw data we’ll process in order to make predictions. Enrich is responsible for sanitizing feedback records, recording instrumentation for our real-time monitoring systems, and transforming the records in preparation for storage in S3. In this way, enrich uses S3 to create a clean “history” of the data our system processes. This history is invaluable for analysts who need to backtest new algorithm ideas on historical data.

The Train cluster is responsible for distilling Enrich’s output into models. A model is a serializable data structure that aggregates the important statistics we’ve gathered about a particular topic or domain. When new feedback records arrive from Enrich, Train pulls the relevant model from our Elastic File System (EFS), updates the statistics with the feedback’s information, and writes it back. This streaming approach means updates to our models are very fast.

The Infer cluster is responsible for making our predictions available to the outside world. It receives prediction requests, which record the details for the type of prediction the requester wants. In our application, we have to reply quickly to prediction requests to provide any benefit. That’s why models and EFS’ fast read time are so important: They allow infer to quickly select the right model to address the prediction request, read the appropriate statistics to make a prediction, and reply.

Finally, each Lift instance has a special test cluster called Loader. This cluster gives us the ability to simulate a population of users generating feedback and prediction requests, to load test the system. These tests are key to verifying the system can scale up (and scale down) depending on the level of user demand.

Decomposing our system this way has had many scalability benefits. By utilizing S3 and EFS for persistence, we can store unlimited amounts of data and avoid the storage limitations associated with most databases. Likewise, decomposing our service into train, infer, and enrich clusters makes it easy for each cluster to scale independently, depending on whatever resource (CPU, RAM, or Network) is most relevant to its work.

By scaling up our load cluster, we can simulate arbitrarily large populations of users and stress-test our builds before they ever reach customers. We can also confirm that our system scales down during less busy periods, so that we don’t pay for more servers than we need. Moreover, by running a scaled-down Loader in production, the cluster effectively doubles as an end-to-end test for our system, generating a small baseline of consistent traffic we can use to verify system health in real time.

Lift doesn’t specify exactly what procedures Enrich, Train, and Infer will perform. This is intentional, so that developers who need to build an ML service can simply “plug in” their implementations of these procedures and use our infrastructure creation tools to build their Lift instance, with supporting servers for CI/CD and monitoring.

DevOps and Lift

The ART team is not very large but we support a large number of servers. To do that, we’ve recently adopted some DevOps practices that are slightly different from the ones we used building CI/CD pipelines in the past.

Our implementation of Lift is 100% infrastructure as code, meaning we can create (or tear down) our entire service (including support servers) automatically, in a matter of hours. This has had several benefits.

  • Decreased contention for test resources: Now, developers can easily set up their own scaled-down version of production.
  • Testable infrastructure changes: We can now test ambitious infrastructure changes in our development accounts. Once a developer has a working change, they can be confident that change will yield working infrastructure in production.
  • Simplified security analysis: Since our environment is 100% infrastructure as code, most security analysis can be performed by inspecting our codebase rather than stateful AWS consoles. This makes it much easier to verify our security groups are set up properly.
  • Centralized Logging and Monitoring: It’s easy for developers to add an Elasticsearch cluster and Kibana instance to their test environment to collect instrumentation and debug problems. Our code adds a log transport daemon (Filebeat) to every cluster worker, so it’s easy to get instrumentation into the system. In production, that same instrumentation enables us to use Kibana as our primary tool for monitoring most aspects of system performance.

We also utilize immutable infrastructure techniques when we create our Enrich, Train, and Infer clusters. We call this system “blue-green deployment”. Whenever we have a new version to deploy, we stand up fresh clusters behind a private load balancer. We designate this collection of clusters a “green” version of our service, as opposed to the “blue” version exposed to customers on a public load balancer. We then apply a battery of acceptance tests (including load tests) to the green service. If they pass, the green service is promoted to blue, and we atomically cutover our public load balancer to point to our new clusters. The older version of the service is then retired, kept in reserve until our automation determines it is time to tear it down. This gives us one more defense against a service outage, because we can always fail back to the retired version of the system.


Blue-Green Deployment

This deployment design has had many benefits for our CI/CD Pipeline. Despite having more strenuous acceptance tests than ever before, we can now put a new version into production in a matter of hours instead of days, with no downtime for deployment and robust defenses against failure.


Building an online machine learning service in the cloud isn’t easy, especially with a small team. Our Lift architecture, and the DevOps techniques that support it, have allowed us to greatly increased our scale while improving our security, developer productivity, and robustness to failure.

]]> 0
Solid-State Power Amplifiers vs. Traveling Wave Tube Amplifiers Thu, 14 Dec 2017 00:45:03 +0000 Viasat is leading a new wave of communications satellite innovation. This is a by-product of our belief that there is always a better way and not being satisfied with the “state-of-the-art” industry capabilities. It demands we think differently as we develop next generation technologies to enable satellite systems that are demonstratively better than have been offered in the past. Our unique industry position allows us to optimize the entire system, from the ground network to space payloads to user equipment. For high-throughput satellite payloads like ViaSat-1, ViaSat-2 and ViaSat-3, new technologies will need to be developed that get the industry beyond the status-quo and allow for orders of magnitude improvement in capacity and coverage area. We’ll discuss how new solid-state integrated circuit technologies are a tool that can be used to improve critical dimensions of performance for new satellite payloads.


An area where solid-state technology is changing the satellite industry is with high power amplifiers (HPA). These amplifiers are the critical part of satellite payloads that amplify the communications signal as it leaves the satellite so that, once the signal is received on earth, it is strong enough for ground equipment to detect and decode. For our GEO stationary satellites, that is a 35,700 kilometer journey. Solid-state high power amplifiers (SSPAs) are one of the areas that can make significant improvements in dimensions critical to satellite payloads.

For over half a century, Traveling Wave Tube Amplifiers (TWTAs) have been the workhorse of the satellite industry. The technology was first invented in WWII for high power RADAR transmitters and was capable of amplifying signals to very high output power levels (100s or 1000s of watts). This technology was eventually adapted for satellite communications. In more recent years, SSPA technologies, such as Gallium Arsenide and Gallium Nitride (GaN), have evolved to a point where they can more efficiently generate these types of power levels and have some benefits over the traditional TWTA solution. Our Viasat team has decades of experience developing these technologies for space, ground and airborne applications.

What’s Important

Most of the power on a communications satellite is often devoted to the power amplifier transmitting a desired signal back down to earth. The better a HPA is at performing this task, the better the system can perform. Because of this, there are several figures-of-merit that are important when we architect a satellite payload. In all these dimensions, SSPAs can be as good as or significantly better than a TWTA solution.

1. Size, weight and power (SWaP)
2. Linearity
3. Efficiency
4. Output distribution losses
5. Reliability
6. RF output power

Size, weight (or mass since we are discussing a space application) and power are precious resources on our spacecraft and, because a typical high-throughput satellite will have many dozens of power amplifiers on a single satellite, there is a large multiplier effect. The larger each power amplifier is, the fewer of them can fit within the spacecraft. The more mass an power amplifier has, the fewer of them that the spacecraft structure can carry and\or the larger the rocket needs to be to break it free from earth’s gravity. The more power an amplifier consumes, the larger the solar arrays need to be to supply the power. In general, the smaller, lighter and less power a power amplifier uses to do the same job, the better. TWTAs are relatively large vacuum tube structures that require significant size and mass when compared to a few millimeter square solid-state integrated circuit that can be packaged in a module 10 to 100 times smaller. Additionally there are the practicalities of interconnecting TWTAs or SSPAs to their antenna feeds. These interconnects can have significant size and mass impacts as well as loss that degrades efficiency.

Dual Redundant TWTA with linearizer.
Source: Tesat-Spacecom
High magnification image of a Viasat solid-state power amplifier integrated circuit. Many of these few mm2 devices are combined to reach TWTA power levels.

Linearity is a measure of how much an amplifier distorts the input signal as it increases it. All power amplifiers induce some level of distortion and the closer they operate to their max output power limit, the more distortion they add. Because the advanced modulation techniques that are used in today’s high speed communication links require low distortion, there are a few options. Oversizing the power amplifier allows it to operate with less distortion at the needed power level but it incurs a size and power penalty for the larger amplifier. Another other option is to correct the distorted signal with a linearizer device. For the case of TWTAs, a separate solid-state linearizer is used which adds more size and mass to the TWTA solution while an SSPA linearizer can be accomplished within the SSPA itself.

Efficiency is a measure of how much of the DC power an HPA consumes is converted to usable signal power that is sent to the ground vs power converted to waste heat that the spacecraft thermal system has to manage. This is one of the most critical performance parameters for a power amplifier and, prior to improvements in solid-state technology, this was the primary reason that TWTAs were not replaced with SSPAs. Previously, SSPAs based on Gallium Arsenide devices could not compete with TWTAs for efficiency or power density but newer GaN processes have changed the game. Our design teams continue to leverage the decades of experience we have building SSPAs to optimize their performance. GaN based SSPAs are now very competitive with TWTA efficiencies and future technologies such as Graphene promise even higher efficiencies and power densities.

When evaluating the impact to efficiency, we consider much more than just the TWTA or SSPA efficiency. Practical implementation details such as signal distribution after the power amplifier output are also a critical part of the overall system efficiency. These losses manifest themselves as precious signal power that is wasted as heat. In order to manage the TWTA high output power, size and mass is sacrificed to use low loss metal waveguide transmission lines to keep these distribution losses low as the amplifier is connected to the antenna feed. This low-loss solution drives more complexity as the routing becomes a nightmare of interleaved rigid waveguides necessary for the dozens and dozens of TWTAs within the system. Imagine hundreds of rigid waveguides criss-crossing within the payload as the signals are distributed from where the TWTAs have room to fit to where the antennas need to be on the outside of the spacecraft. By contrast, SSPAs have the enormous benefit of being small and compact enough that they can be placed out at the antenna feed and avoid distribution loss altogether.

Reliability is a key design requirement for our spacecraft payload designs since a significant investment is being made in an asset that needs to work for 15+ years with no opportunity for repair. TWTAs represent a single point failure opportunity that can cause an entire coverage area of the system to disappear that is unrecoverable. To overcome the single point failure concern, a redundant spare TWTA would be required. Large, lossy waveguide switch or combiner networks are implemented to allow the spare TWTA to be used in-case of a failure. While not being used, the spare is not contributing anything to the performance of the system but its size and mass have to be included in-lieu-of other components that can improve system performance. On the other hand, SSPAs use power combining schemes that are inherently redundant by the fact that they combine many smaller amplifiers together to achieve the overall desired output. They exhibit graceful degradation as failures occur resulting in only slightly degraded performance rather than full outages.

TWTA is not internally redundant.
Failure recovery requires a completely redundant TWTA.
SSPAs are internally redundant. One amplifier failure only results in a small performance degradation.


RF output power is a measure of how large the radio signal can be increased by a power amplifier. Payload system designs can vary significantly in HPA output power size requirements and depending on the frequency and power level, SSPAs can be perform as well if not better than TWTAs. SSPAs use power combining techniques to add the power of many smaller power transistors together, often on the integrated circuit itself. These power combining techniques have good and bad aspects. All power combining methods do have inherent losses so they will degrade the system efficiency. For this reason, very high power applications are still often performed using TWTAs since the SSPA power combining losses become overwhelming. That being said, newer high efficiency solid-state technologies such as GaN allow for higher power transistor building blocks so fewer need to be combined for a given power level. This results in reduced output losses and higher system efficiency.

SSPA Challenges

Although SSPAs have a lot of promise in changing the way we build and design satellites, they come with their own challenges. One of the primary concerns is thermal management. A key benefit of working with newer solid-state power amplifier technologies is the increasing power density but this benefit comes with significant thermal challenges. The reliability of solid-state devices are directly related to their transistor device temperature so a good thermal management design is critical. When you consider that all the power lost due to inefficiencies is concentrated in just a few square millimeters, good thermal management directly under the SSPA die is essential. Also, taking advantage of the ability to place the SSPA directly at the antenna feed complicates this further since the thermal design has to extend out to the feed and maintain a safe and reliable operating temperature at the SSPA.


Solid-state technology has been changing our world for over 50 years. Why should the satellite industry be any different? Viasat continues to “raise the bar” of what customers should expect broadband satellite communications to be and solid-state technologies will be a key tool to meeting those expectations.

]]> 0
Orchestration for a Virtual Network Mon, 09 Oct 2017 06:00:17 +0000 Our next generation network is basically built with mostly virtual network functions. These are services that ViaSat has migrated from custom/purpose built hardware to a virtual platform. One key component for construction of virtual networks is orchestration. It ties the deployment of the virtual network functions with the rest of the network using the network controller service that I wrote about in my last blog.

There are a lot of open source orchestration tools out there. ViaSat started with some: Heat/OpenStack. Most of the open source components did not fit our needs at that time. Very soon we realized that this piece needs to be built in-house.

The top things we needed from an orchestrator were:

  • Ability to orchestrate real time applications – Most of our VNFs (Virtual Network Functions) have real time needs. They drive a satellite after all.
  • Ability to create a network with real and virtual network functions – We are retrofitting a running network. Like changing tires on a race car while it is being driven.
  • Ability to do hybrid clouds – We use public and private cloud-based functions.
  • Ability to build services once and deploy everywhere.
  • Scale services in and out.

Introducing Venom

We started with an orchestrator that we built for our labs and extended to support the service provider use cases.

The orchestrator is designed to sit in the public cloud and have agents that sit either in the tenant’s public or private cloud connect back to it. That way the tenant’s data is under their control all the time while still leveraging a Software as a Service orchestrator.

Service Templates

The basic block of Venom is the service template. Expert users can create templates representing complex networks comprising of real and virtual devices and configure them. Other users can take these templates and instantiate isolated instances of it in private or public clouds.

An end to end service can be comprised of several templates networked together in a hierarchical or serial fashion.

To handle Physical devices, users create proxy functions in Venom. The proxy functions are translated to real devices once they are instantiated.

Some of the benefits of designing things using these templates are:

  • The same template can be used in the continuous integration cycle all the way from development to test to deployment – this reduces bugs drastically.
  • People can share hardware in their own isolated labs with different topologies.
  • The same topology can be instantiated in different locations and across cloud boundaries.
  • We can have an “app-store” model where people publish tested microservice-based clusters. Other people can use these to create more complex services easily.

In the next post I will talk about monitoring in a virtual network environment. It is very interesting and hasn’t been solved completely.


Editor’s note:
This is the third article in our series on Network Virtualization.  Please check out Pawan’s previous articles:
    Virtual Service Provider Networks
    Building Blocks: Dynamic Service Chaining

]]> 0
Azimuth and the Story of Immediate Feedback Tue, 19 Sep 2017 21:44:13 +0000 We would like to show you Azimuth, a new tool for exploring networks, built by interns at light speed. It’s like a command-line, except instead of the output being text, you get an interactive 3D graph.

How do you understand a network as it changes and grows by orders of magnitude?

Hand drawn maps don’t scale and machine drawn maps just show too much information. We took a stab at immersive 3D and VR, but found out that it didn’t really simplify anything because we were just pushing complexity into the z-axis.

After a lot of trial and error, we discovered that having a command-line tied directly into the running application, not only sped up development (which is where we got the idea), but also provided a powerful interface for users to dominate the complexity.

Instead of trying to simplify everything imaginable for users a priori, the key is to give tools that provide significant power coupled with immediate visual feedback (context), so users can do it themselves.

So, let’s check it out.

What can a command-line view of the network do?

Let’s say you just joined the team and need to update some backend “Starscream” servers. For starters, where are they even located and what do they depend on? Normally we’d have to read a bunch of code and hunt around to figure this out, but here we can just use a search command.

The Azimuth command-line is just a JavaScript REPL, so we get direct access into the running system and a set of functions to interact with it. Here we can just select anything tagged with the name “Starscream” and start exploring from there.

Now that we found the security groups for the Starscream servers, we can see how they are connected and get more details from the information cards in the panel on the right.


Next, we need to make a few changes to some web servers that host services over HTTPS, so we can again use the command-line to only show port 443 connections and focus on the task at hand.


Finally, since some of these services are load balanced, we can toggle into 3D mode and check it out.

Although most layout is done in 2D, the z-axis does come in handy. Here we can see the load balancers with all of their instances stacked on top of them.

Internally we’re starting to use the tool for keeping up with network changes, security checks and visually debugging connectivity issues. Next up, we’re planning to add cost tracking as well as usage and traffic information, plus a time machine so it’s possible to go back and watch changes to the network over time.

Combinatorial Power

We’ll get into details on how we built Azimuth in another post. For now, we want to describe how we designed the system around function composition to exploit Metcalfe’s law.

Metcalfe’s law is usually stated in terms of telecommunications networks, where the value of the network rises on the order of O(n²) as the number of connections between nodes increases. However, this also carries over to functional programming and system design.

By building around a set of functions that operate on a common data structure, every new function we include multiplicatively increases the value of the existing system, since each new feature can be used with all the existing ones.

For example, by adding a new findByTraffic() function, we can use it with the existing functions to select all the links with port 443 traffic that are over one megabyte.

select(findByPort(443), findByTraffic(1000))

As a software engineer, it’s fun to have a quadratic O(n²) term actually working in our favor for once. But besides just using this to simplify internal designs, we’ve found that carrying this power onward to end users by exposing it through a command-line interface, combined with immediate visual feedback, is pretty compelling.

The importance of immediate feedback

Programming is a process of discovering exactly what we need to know in order to solve the challenges we face. Anything we can do to shorten this build, break, learn feedback loop, the faster we can converge on the understanding necessary to actually solve these challenges.

When we started the project, we didn’t know anything except that we wanted a system for mapping the network. Everything else we would have to figure out along the way. By building a system around quick iteration (to minimize the penalty of missteps), we can maximize learning and applying what we’ve learned, since we need to quickly discover what we don’t even know.

From the beginning, we built around the idea that we should be able to do an entire day’s work without restarting the server or ideally even refreshing the browser. When we save changes to our code, they should be immediately hot-swapped into the running application, so we can experiment and adapt as fast as possible.

It may seem eccentric, but this isn’t just about speeding up the workflow, this is about writing code that is easier to modify and change.

Towards better code

Good code is changeable because successful projects will always need to change. If code is so malleable that it can survive being altered while running, then it’s arguably pretty good stuff. Going further, we’ve seen that code written this way is also easier to refactor and update with new features.

“[Choice] is the only tool we have which allows us to go from who we are today to who we want to be tomorrow.” – Sheena Iyengar

Writing easily modifiable code is all about preserving choice, protecting the amount of options we have to incorporate what we learn tomorrow, into the code that we wrote today, and not getting trapped by the uninformed choices we made yesterday.

All in one wild internship

Hope this gives you a taste of what our interns built this summer and the underlying philosophy. Interns Stephen White, Sean Lee, and Robert Henning had a blast creating it.

We’ll be back with more details and code samples on how to build an immediate feedback system next time.

Also, if you’re looking for a radical internship and learning everything from 3D graphics, to functional programming, UX, networking and surfing (we’re near the beach) — all in one summer, give us a shout.

Check out: ViaSat Internships


]]> 0
Using Artificial Neural Networks to Analyze Trends of a Global Aircraft Fleet Wed, 05 Jul 2017 20:20:59 +0000 I’m an intern in the Commercial Mobility group at ViaSat.  Our group is responsible for all of the company’s commercial aviation clients, providing internet services to aircraft. While providing the world’s best in-flight internet service to airplanes traveling over 500 miles per hour 30,000 feet above the ground is no small feat, it is also a challenge to analyze and predict user demand of our network. There are typically several hundred planes connected to ViaSat’s network at any given time amounting to 15,000-40,000 flights a week depending on the season. With this much range and traffic, and flights leaving all times of day, all over the world, modeling anything about them becomes very difficult.

A Brief Recap of Machine Learning

Machine learning has recently risen as a popular buzz word across industry and academia and covers a wide range of fields and procedures using computers algorithms to process and meaningfully interpret real world data. What sets machine learning apart from traditional methods is, well, the learning. Instead of a developer programming every possible scenario, machine learning algorithms are trained on known data and left to interpret new data, based on what it has been taught.

What is an Artificial Neural Network?

Artificial neural networks (ANNs) are machine learning models inspired by synapses in the brain. They are constructed by connecting layers of nodes with a dense web of links, bordered by an input and output layer. The internal layers are referred to as hidden layers because anyone calling the model will not see or interact with them. They are hidden behind the interface of the input and output layers.

Mathematically we represent each node layer as a function and each link as a weight. The web of links between each layer can then be represented as a matrix. This matrix architecture has several advantages.

First, it allows the model to simultaneously process multiple sets inputs and return a set of outputs for each. Second, the entire network can be summed up in one simple equation:

However simple this equation is, it can still require massive amounts of processing to evaluate. ANNs can have anywhere from a few to several hundred layers with anywhere from a dozen to several thousand nodes.

Models with a high number of layers are called Deep Belief Networks and give rise to the term “Deep Learning.”

To evaluate these models we use video graphics cards. Surprised? You shouldn’t be. All the 3D graphics you see in video games use the same types of matrix math we harness in ANNs. Commercial gaming cards are built and optimized to compute huge matrixes very quickly, which makes them great for both gaming and running neural networks. If you’re more interested in just how exactly all this works, there are lots of resources out there. Lumiverse has a particularly good video series on the interworkings of ANNs.


As mentioned in the introduction above, the Commercial Mobility group at ViaSat provides the world’s best in-flight internet service to commercial aircraft.  With hundreds planes connected to ViaSat’s network at any given time around the world and with seasonal spikes in air travel, modeling the network becomes very difficult.

Fortunately, this kind of high-dimensional, non-linear problem is just what neural networks excel at. Using public data such as departure and arrival times, airline, and the origin and destination of flights, coupled with ViaSat’s internal metrics for flight usage we trained a neural network to predict the number of devices and the satellite link data use for all aircraft. These predictions are compared to the actual data at one hour intervals in the plots below.

Personal Electronic Device Count

Forward Link Data Use

Return Link Data Use

While these models are not perfect, they are substantially more accurate than the statistical models previously in use. The best part is, these models can continually be re-trained with more current data, keeping up with changing trends in user demand as our services grow and more airlines switch to ViaSat’s internet services. We can also always go back and add more input nodes, with more metrics to improve our understanding of when and why people use the internet while they fly.

Moving Forward

For those of us in Commercial Mobility, this is just the tip of the iceberg. Artificial neural networks provide us a whole new way to examine data and solve problems. We can use ANNs to identify common factors driving problems that could cause service outages and loss of flight connectivity. With this kind of insight, we can remedy edge-case scenarios that could otherwise cause failures, and we can take measures to prevent problems from reaching customers by predicting them far enough in advance. It’s a big complicated world out there and machine learning is giving us here at ViaSat a whole new perspective.

Editor’s Note:  This article was written by one of our interns describing his project for the summer.  ViaSat is proud to give our interns challenging, real, and valuable projects to work on during their internship. Keep an eye out for more posts from our amazing interns!  All posts will be tagged with the “Intern Project” label.

]]> 3
Building Blocks: Dynamic Service Chaining Wed, 14 Jun 2017 19:12:45 +0000 As a service provider, our network has to handle millions of devices and millions of flows. With our in-flight internet service, these flows are moving; they have to access critical services in different datacenters in the network. It makes the network quite unique and dynamic.

As I wrote in my last blog, our network is comprised of many kinds of elements:

  • Traditional metal routing/switching boxes
    • Capable of moving around large number of bits.
  • Programmable switches
    • We use APIs to setup our underlay network as a service.
  • Direct connect and VPN tunnels
    • Ability to reach resources in the public cloud
  • Virtual network functions
    • We run our network functions as services. They can run either in the public cloud or in our own data center
  • Firewalls
    • A service comprising virtual and metal-based firewalls
  • Legacy metal-based network functions
    • DPI engines, access/MAC and PHY layer for the older satellite, traffic conditioners

One of the first building blocks we used to start building our next generation service provider network was dynamic service chaining. We called this service “vForwarder”.

Traditional Core Networks

In traditional core networks, traffic flows in a very well-defined and controlled path.

All traffic flows through all of the network devices that are pre-wired together. This has certain negative implications:

  • The network functions that carry the traffic have to scale at the same time.
  • Every time there is a network function that needs to be added, even for a small number of subscribers, the paths need to be changed.
  • A fault in one network function can impact the whole network path.
  • All traffic needs to traverse most of the network devices even if they don’t need to act on the traffic.


Dynamic Service Chaining Approach

When we started building out the next generation network we decided to build a micro services approach for the network to address the challenges with traditional core networks.


The service allows network flows to follow different paths through the network. It allows flows coming from the same device, even the same app to go through a distinct set of network functions.

The high level goals are:

  • Allow the network to behave like a set of connected micro-services.
  • Allow the addition of new network services dynamically without impacting other services.
  • Let network functions scale independently of each other.
  • Bypass or route around faulty network functions.
  • Allow integrating legacy network paths and equipment.


ViaSat Virtual Network

We created a completely distributed network service. It includes hardware from well-known routing/switching companies and a whole lot of home grown software.

The network controller is home grown and runs in each datacenter and also in the public cloud. It scales to millions of devices with thousands of flows per device.

From each vForwarder perspective they see a set of Network Function interfaces (NFis) and a set of overlay tunnels. The Network functions can be virtual or physical.

The network functions can use the Controller’s API to attract traffic flows to their own NFis. They can also determine if the NFi needs to have a certain encapsulation – VLAN, GRE, VxLAN etc.

The overlay interfaces can have a completely different encapsulation on each link. This allows fully mesh-connected network functions within and across data centers. Even functions in the public cloud can be connected in a chain easily.

The distributed network controllers allow a service chain to span data centers and network functions can be placed anywhere in the network. The controller also programs the overlay gateways in the border leaf routing devices.



Using our network controller and the vForwarder service we created a platform that allows:

  • New functions can be added very easily into the network. The neighboring services don’t even need to know.
  • Network functions can now scale out independently of each other since the traffic is selectively sent only to the functions that need them.
  • We can move flows and combine service chains across data centers – allows for a truly mobile and dynamic end-user service.
  • The ability to route or re-route traffic on a hop by hop basis. So if a network function changes or remarks traffic the next hop is truly dynamic.


Next I will talk a little more about Network Function Orchestration in this environment. Please watch this space.


]]> 0
VWA’s Journey to DevOps Wed, 17 May 2017 03:40:38 +0000 ViaSat Web Acceleration (VWA) is a product that provides terrestrial-like web performance for our Exede satellite internet customers.  It is a merging and evolution of two ViaSat products: iPEP, a TCP accelerator, and AcceleNet, a web accelerator.  Development of VWA started about four years ago, as a project in our Acceleration and Research Technology (ART) center based in Boston.  More importantly, it was ART’s first project that experimented with continuous integration (CI).  We used a tool called BuildBot which is what the open source browser Chromium uses to orchestrate CI builds and unit tests.  The benefits of CI were quickly realized by the team, and we never turned back.  At the same time, we also started working on an automation framework that would allow us to do automated installs of VWA and automated tests in our lab.  We would schedule these tests to run on a regular basis from BuildBot and the first iteration of a VWA pipeline was born.

Two years ago, we started focusing a lot more of our efforts on CI/CD.  We started using GoCD as the pipeline orchestration tool primarily because of its fan-in fan-out feature, which fits perfectly with our pipeline design.

We collapsed all project code branches into one main branch so we could manage the pipeline more easily.  We reduced our build time 8-fold in order to give developers fast feedback.  We started working on push button deployments using Ansible, with the vision that the same set of deployment scripts would be used for deployments into lab environments as well as production environments.  This allows us to test our deployments through the pipeline at all times.  We architected the pipeline to include multiple lab environments, with each lab environment running different sets of tests based on the infrastructure, in order to gain broader test coverage.

VWA Deployment Pipeline

However, we eventually learned some important lessons about the pipeline.  First, having a stable pipeline takes time.  In fact, our team is still fixing pipeline issues on a daily basis.  Most of the issues are related to unstable labs, which brings me to the second point.  Running through labs that are not built and supported for CI/CD drains a tremendous amount of resources.  Our developers spent many hours debugging lab issues, instead of fixing actual bugs, because labs were either unstable or there were too many moving parts.  To mitigate, we have started migrating tests from less stable labs to labs that are dedicated to the pipeline.

A year ago VWA went live on Exede with Alpha.  We were one of the first apps teams to deploy into Exede using a CI/CD/DevOps model.  Needless to say it was a major culture shift for both the VWA development team and our Exede service operations team in Denver.  Working together, we created a “Fast Pass” that allowed us to deploy new versions of VWA into Exede without going through the usual formal software release process.  Back in the AcceleNet days, we would release our software to Ops and we would be done.  But with DevOps, the VWA DevOps team decides what to release, when to release, and how to release.  There were no more gates imposed by the Ops team.  It also means that the VWA team is more accountable than ever to deliver quality software to customers quickly.  Every deployment is automated with a push of a button.  Our goal is to make releases and deployments boring because they are repeatable and there are no surprises.  We are not completely there yet but we have made great strides towards the goal.  We spent a lot of effort making sure deployments are truly push button and are resilient to external conditions such as intermittent network issues.  We have a defined promotion policy for every release that we strictly follow to ensure quality software is being released.  We have done over 200 deployments into production so far, and that is a major accomplishment by the team.


  Before After
build > 2 hr 15 min
release testing weeks (mostly manual) 24 hours (automated)
deploy to one production server hours (mostly manual, requires making new image) 10 min for single node cluster
3 hr for 18-node cluster
All automated
release to production weeks 3 days (1 day on alpha, 1 day on beta, rest of prod can be deployed in 1 day)

In order to support VWA end-to-end, we focused on making sure that the VWA team has the ability to monitor the production network, since it was the first time we needed high visibility into the production environment.  We built an internal monitoring and alerting system that involved adding SNMP traps to VWA to signal critical issues, which would get funneled into Splunk and then trigger notifications to the team.  As we continued to roll out VWA to the rest of the production network, we realized that just having an internal monitoring system was not sufficient.  We failed to detect issues in production because VWA could get into a bad state where alerts were not being sent.  We quickly pivoted and implemented an external monitoring and alerting system called Cluster Doctor that provides us reliable and accurate data about the state of VWA.  It monitors the health of all VWA clusters by periodically polling each node looking for anomalies.  Together with xMatters integration and a 24×7 on call team, we are able to quickly detect and address critical issues that could be customer-impacting.

In addition to monitoring for critical issues, we also focused on making tools to help us find regression issues from build to build so we can make informed decisions about whether to roll out a build further into the network.  We built a pipeline dashboard that gives us the ability to quickly identify a release candidate.

VWA Deployment Pipeline Status

We built Grafana dashboards that allow the team to see the performance of VWA across the production network in real time.

VWA Grafana Dashboard

We are tracking crash rates and Mean Time Between Failure (MTBF) for each build.

VWA Mean Time Between Failure Graphs

We are processing crash reports in bulk and categorizing them with automation that allows the team to prioritize working on the high impact issues first.  The team is more hands-on than ever before and it has led us to find unexpected issues as well.  For example, we identified from processing VWA crashes that there are many fielded terminals that needed to be replaced.

Being on call 24×7 is a major culture change for the team.  Fortunately, xMatters made the transition less dramatic.  We have four on-call teams, each with three members covering around the clock.  Teams alternate between weekdays and weekends.  For example, team 1 is on call on weekdays during week 1; then weekend during week 3.

VWA On Call Rotation

Rotation also happens within each team of 3.  Primary responder becomes the last responder the next day.

VWA Team Member Rotation

The challenge is to get the number of false alarms down so the team does not experience battle fatigue.  There have been many instances where the monitoring system signals an issue and the on-call team gets notified, only to find out that the issue is not related to VWA and there is nothing we could have done.  Some examples are neighbor subsystems being down, maintenance events, and weather outages.  We are constantly keeping an eye on team burnout and the effectiveness of escalations by keeping track of an on call journal for every incident, tracking metrics such as number of escalations and percent of false alarms, performing regular retrospectives, and identifying/prioritizing areas of improvements.

We learned that in order to do DevOps effectively, communication between the development team and the service operations team goes both ways.  For example, when the VWA team gets an escalation and plans to perform system recovery (e.g. resetting a node), the plan needs to be communicated in order to make sure it would not run into conflicts.  On the other hand, timely and reliable information on neighboring networking conditions that could be causing VWA alerts should be communicated by the ops team.

CI/CD/DevOps is a continuous improvement process and the team is constantly adapting to new processes and new ways of thinking as we take our lessons learned along the way.  I am confident that we will take the lessons learned from VWA and apply them to the rest of the organization as we transition into DevOps.


]]> 0
Virtual Service Provider Networks Thu, 27 Apr 2017 01:15:21 +0000 The service provider network is the essential pipe that delivers connectivity to enable innovation. It has the potential to limit or expand the innovation that people and applications riding on it can deliver. With the advent of new apps, games, and devices, the demands on the network keep changing almost every day. The traditional service provider network that took months (if not years) to evolve is not able to keep up with the demand.

At the same time, ViaSat is launching next-generation satellites that have capabilities to cover the whole planet with high speed internet. In order to be more agile, we at ViaSat embarked on a journey to create the next generation network that supports a worldwide and ever-changing footprint.

Existing technology

We started by assessing our existing technology to see if it would fit the bill. Our main goals were:

  • Stability: We are building a service provider network; it’s an essential utility.
  • Hybrid: The world the network lives in now is not new. There is an existing internet and network out there and we have to co-exist for a long time.
  • Agile: We need to have the ability to change stuff at an enterprise speed in a service provider world.

There were so many options but nothing that fit our needs perfectly. We ended up with a solution that is truly hybrid.

Traditional Datacenter

The picture above shows a traditional data center design. The content providers are inside the network. The consumers need to reach into the data center, get to the content and retrieve it.

The number of endpoints that serve content are in 1000s.

Service Provider Datacenter

The needs of a service provider (SP) data center are very different. The SP datacenter is supposed to be transit; in fact if it is done correctly, it should be invisible to the end user. They shouldn’t even know of the existence of most of the apps and services in the data center.

The demands on this type of data center are drastically different:

  • It needs to handle millions of endpoints with 100s of millions of sessions
  • Security is per session and not per tenant
  • The network functions are virtual and/or physical depending on the function and vendor
  • The network needs to co-exist with an actual running service provider network made up of metal gear
  • An IP packet traverses several different network functions and their associated service points before it leaves the data center
  • Some of the applications, especially the MAC layer, are jitter sensitive and we need near real time performance
  • The network interfaces of different functions need routing or switching at both Layer 2 and Layer 3
  • Traceability and monitoring are big challenges; there is a lot of east-west traffic between virtual entities


The Solution

We came up with a very unique solution to this problem.

Virtual Service Provider Network

We took most of the network functions and virtualized them. The functions that could not be virtualized, we created proxies for. Once the network functions were virtualized, we created well-defined microservices out of them.

Then, we migrated some of the traditional control and management plane services into the public cloud. The data plane is in ViaSat’s own private cloud as it needs to interact with existing networks and carry a lot of transit traffic.

This infrastructure can be used by other service providers in the future. It is purpose-built for a service provider network, providing the stability and agility we needed.

We also created a whole set of new orchestration, dynamic service chaining, firewall, and IaaS infrastructure.  These will be discussed in future articles.

If you want to work with us to build this exciting new service provider cloud, please check our open positions!

]]> 0