Obtaining and Using Real-time, Normalized Raw Data

It is our great pleasure to introduce Diane Strutner as a guest on the Zype blog. Diane is the CEO and co-founder of Datazoom and co-founder of WomenInStreamingMedia.org. Datazoom's video data control platform aligns streaming technologies with raw, standardized, data.

Diane takes a deep dive into the important role of data throughout the VidOps lifecycle and her insights are useful for any VidOps team to make smarter decisions.

Recently, Zype published an incredible white paper on VidOps, a helpful framework for aligning the teams, tools, and processes which support streaming video operations. As the white paper states, data sits at the heart of any VidOps implementation. Here, we’ll flesh that idea out in greater detail, as well as make the case for why must place a higher value on getting better data for the best VidOps implementations.

Trouble gathering data to detect problems.

Whether you’re live streaming or providing VOD content, at the end of the day the VidOps team needs to have the best information at their disposal which describes the state of streaming. This means the unbiased, unaltered truth which comes in the form of raw data. Because it is so important, data leveraged for VidOps must possess a few key traits - rawness, standardization, and speed.

The rawness of data refers to the degree to which data is free from processing. Too often, data and metrics are treated as synonyms, but the two are worlds apart. Metrics are subjective interpretations of objective raw data. One could compare the same metric computed by two analytics platforms and observe wild differences between the two. Discrepancies often arise from different “secret sauce” formulas often including client-side “pre-processed” metrics which bias final, visualized, metrics. Raw data on the other hand, like a buffer event (vs. the metric “buffer ratio”), can always be traced back to a single and repeatable measurement or event.

Raw data which is not normalized against a single standard is difficult to manage and harder to analyze. When data, such as a ‘play’ event, is pulled from various platforms like iOS, Android or Roku, the data nomenclature is different. This means that when it comes to analyzing that data, instead of writing a query with a single variable to represent ‘play’, it could mean needing 20+ variables to capture all ‘play’ events across all platforms. Without pushing data through a normalization process makes analysis more complex and leaves more room for errors.

However, with the vast number of data points that can be captured from each platform, normalizing data is no small task. It is detailed and precise work to pair key-value data according to a single standard. Even more time consuming can be the transformation of data to meet standards set by other systems. To get you started, you may consider using Datazoom’s Data Dictionary, an example of how to normalize video data.

Raw data should be available in real-time to maximize value. Even if your data is raw, normalized, and possessive of context, if there’s a delay in the accessibility of the data, even just minutes post-collection, it can lose much of its value. Since the viewer’s tolerances for streaming experiences continues to wane, the time to make impactful changes might be a few seconds at most. Therefore the systems that collect and normalize data should be assessed not only for accuracy but also for the amount of latency added. To maintain the most value of data, processing times should target under a second - a.k.a. true real-time.

Difficulties understanding the context for which problems exist.

Understanding a problem exists is part of the challenge, however, today this is also where the usefulness of most systems ends. Metrics provided by analytics, even with the best data, only declare that a problem exists, but understanding the root cause of an issue can be difficult to determine.

Consider this: For the wide majority of distributors who do not own end-to-end infrastructure, a variety of third-party vendors complete the task of delivery. Those vendors, like CDNs, contract with other vendors, like transit providers. So when we are alerted to a problem, such as high buffering affecting a stream, how do we go about identifying the root cause of that issue and therefore what actions are ought to be taken?

Understanding the root cause of an issue requires a thorough grasp of context, and if we use buffering as an example it would mean looking at data not only coming from the video player but also from the application layer, CDN logs, transit provider availability, and ISP throughput and even aligning data from content preparation (encoding/transcoding and packaging). Basically, at a raw data level, we need to be enhancing and joining additional context to data before we can perform a proper analysis. If the typical response to buffering is to switch CDNs, it does us no good if we later find out the issue actually resided downstream with the ISP. Data without context can lead to incorrect conclusions.

The biggest challenge is the ability to resolve problems “on time.”

One of the unique qualities of video streaming is how a multitude of systems must come together in real-time for each view. Unlike other internet content, studies show that viewers have a low tolerance for quality issues, such as buffering and slow video start times. Even if we can identify these issues, the biggest challenges remain in how to make changes which create an impact. When it comes to streaming, in order to address issues in time, we must have the ability to make changes in real-time. In other words, we must leverage machine-driven automated processes to consistently “complete the loop” of systems identifying as resolving issues.

Establishing a pre-programmed response to changes allows for issues to consistently be addressed. Issues which contain a single root-cause failure will have the most simplistic response. As an example, if an ad server returns an error code for a malformed URL, there’s likely only one change to be made - replacing the malformed URL with a new one. In this case, if we’re always able to replace a malformed ad URL before a user notices and abandons the stream, we reduce churn while generating more revenue.

Sometimes, the system which requires change is outside our realm of control. In the instance of a CDN, thanks to advancements in software-defined networking, there are more opportunities than ever to share data back with other parts of the video delivery chain, like the CDN. Thus, the CDN can then make better changes to their own system. Maybe they change egress pathways to avoid an overburdened node or find a more direct peering path with an ISP. Either way, data in real-time creates an opportunity for feedback loops to be installed between critical services which will improve the streaming experience.

With better data powering automation and promoting a real-time feedback loop between the distributor, network, and beyond, VidOps managers could transition from reacting to problems to proactively hedging against them.

How to identify the best systems and sources for raw data.

Where does one bent on implementing VidOps go to start collecting data? What technology is suited to the task of collecting all the data we need? Will that system normalize quickly and efficiently? What about sharing that data with analytics, CDNs, and automation systems?

Consider using Datazoom, which accomplishes each and every one of these tasks. With minimal coding effort, Datazoom’s Video Data platform can originate and captures data from players using SDKs and Libraries, or connect to APIs and logging systems, normalizing data according to our Data Dictionary, and connecting it to third-party systems and stakeholders with sub-second latency, helping to operationalize your video data.

If you need help bringing VidOps strategies to your approach to video, my friends here at Zype, as well as my colleagues at Datazoom, can help you in that journey. Let’s chart a course for the future of video, together!