
We helped BazaarVoice handle 2 billion requests a month
How Schema Evolution helps power 2 billions requests at Bazaarvoice
Bazaarvoice is a company that uses data to help businesses make smarter decisions, and in turn, help customers make better choices. Bazaarvoice's client base includes some of the biggest names in retail, including Walmart, Crate & Barrel, and Sephora.
We helped Bazaarvoice build an API that can handle up to 2 billion requests every month. This allows Bazaarvoice to quickly gather user data from all over the world for their clients, and then share that data with them in a format that is more useful than ever before.
This was a huge project for us because of the sheer amount of information we are working with. We had to create the flexibility and security necessary to accommodate the vast number of users who interact with Bazaarvoice on a daily basis. It was a challenge that we were happy to tackle head-on.
A key component of this distributed system at scale was schema evolution. Data in a system like this needs to be serialized, and then deserialized several times throughout it's life. Schema changes across your consumers and produces can cause havoc when managing a system at this scale.
In any complex system, a schema change must be handled seamlessly. Schema evolution guarantees that a compatible change in one place will not break any other part of a system. Put another way, downstream consumers *do not need to be updated* to handle any compatible changes upstream. This is valuable because it allows us to perturb the graph without worry of breaking something downstream.
There are three main schema evolution patterns:
- Backwards - data encoded with an older schema can be read with a newer schema
- Forwards - data encoded with a newer schema can be read with an older schema
- Full - data encoded with older or newer can be read by both older and newerschemas
Which pattern to choose?
The evolution pattern is chosen by the relationship between the producer and its consumers. A handy way to tie the compatibility modes with the producer-consumerconcept is the following:
- If the producer falls behind (‘backwards’) its consumers, meaning schema changes are made to consumers but not the producer, then we need *Backward*compatibility to ensure seamless operation.
- If the producer falls ahead (‘forwards’) of its consumers, meaning changes are made to the producer but not its consumers, then we need *Forward* compatibility to
- If the *producer and its consumers *have no well-defined relationship, meaning changes are made to both, then we need *Full* compatibility.
To rephrase for emphasis, if we evolve by making compatible changes, we can make anycompatible change and guarantee our system continues to work seamlessly.
How does evolution provide such a strong guarantee?
Evolution provides such a strong guarantees by supplementing missing fields with default values. Let us examine these patterns a little more carefully with some examples:
Backward Evolution:
We may pick this setting for the following reasons:
- If the producer falls behind its consumers because changes are made to consumers
- Consumers are fluid and ever-changing, and producers do not change (slow to change)
- New consumers have to read historical data (replay a topic), some of which is written using old schemas
How to evolve in a backward manner (simplified):
- Adding new field: Must specify a default value.
- Removing any field: No restriction.
- Mutated name: Essentially two operations: a remove and then an add. Same restrictions apply.
- Mutated type: Not allowed except for Avro promotion int => long etc.
// Weather - Backward evolution
// Schema v1
{
{ "name": "temperature"
, "type": "long" }
}
// Schema v2
// Add a field, humidity, with a default to maintain backward compatibility
{
{ "name": "temperature"
, "type": "long"
},
{ "name": "humidity"
, "type": "long"
, "default": 0L }
}
// Schema v3
// Add another field, wind, with a default value to maintain backward compatibility
// Remove temperature field
{
{ "name"
: "humidity"
, "type": "long"
, "default": 0L
},
{ "name": "wind"
, "type": "long"
, "default": 0L }
}
Producer v1 is producing documents with only temperature while Consumer v3expects wind and humidity. Since we have default values in place for both wind and humidity inv3, we are able to continue operation.
Consumers can be upgraded to any evolved schema that is higher than the producers schema, and still function without breaking. Producers can upgrade schemas, but only if they are at the version or below all their consumers. To illustrate from our example, Producer v1 cannot move to v2 until Consumer v1 moves to v2.
In more concrete terms, when the application code inside Consumer v3 tries to something to the effect of record.getField(“wind”) , it will return a value of 0L, which was the default specified in the schema. This trivial explanation will make more sense when we get to the Schema Registry and how it auto-resolves schemas.
Forward Evolution:
We may pick this setting for the following reasons:
- If the producer falls aheads of its consumers because schema changes are made to producers
- Producers are fluid and ever-changing, and consumers do not change (slow to change)
- Unchanged (old) consumers need the ability to deserialize newly produced documents
How to evolve in a forward manner (simplified):
- Adding new field: No restriction, except the obvious (cannot be the same name as another field etc.)
- Removing any field: To delete a field, it must specify a default value.
- Mutated name: Essentially two operations: a remove and then an add. Same restrictions apply.
- Mutated type: Not allowed except for Avro promotion but reversed. long => int
// Weather - Forward evolution
// Schema v1
{
{ "name": "temperature"
, "type": "long" }
}
// Schema v2
// Add field: humidity with default
{
{ "name": "temperature"
, "type": "long"
},
{ "name": "humidity"
, "type": "long"
, "default": 0L },
{ "name": "humidity"
, "type": "long"
}
}
// Schema v3
// Add field: wind
{
{ "name": "temperature"
, "type": "
long" },
{ "name": "
humidity", "type": "long"
, "default": 0L },
{ "name": "wind"
, "type": "long"
}
}
// Schema v4
// Remove field: humidity
{
{ "name": "temperature"
, "type": "long"
},
{ "name": "wind"
, "type": "long"
}
}
Producer v4 is producing documents with only temperature and wind while Consumer v3 expects temperature, wind and humidity. Since we have default values in place for humidity (0L), we are able to continue operation. Consumer v1 continues since all it needs is temperature.
Everything in this system will continue to work without a hiccup. Producers can be upgraded to any evolved schema that is higher than its downstream consumers schema. Consumers can upgrade schemas, but only if they are at the version or below their parents producers. To illustrate from our example, Producer v4 cannot move to v2 until Consumer v3 moves to v2.
In more concrete terms, when the application code inside Consumer v3 tries to do something of the sort of record.getField(“humidity”) it will return a value of 0L. Again, this seemingly trivial explanation will make more sense when we get to the Schema Registry and how it auto-resolves schemas.
Full Evolution:
We may pick this setting for the following reasons:
- If the producer falls behind its consumers and consumers fall behind producers
- Consumers are fluid and ever-changing, and producers are also fluid and evolving
- We want the ability for old and new consumers to deserialize old and new data.
How to evolve (simplified):
- Adding new field: Must specify a default value.
- Removing any field: Field must have a default value.
- Mutated name: Essentially two operations a remove with an add. Same restrictions apply.
- Mutated type: Not allowed (even Avro promotion) The only type change allowed is string <=> bytes
Producers can be upgraded to any evolved schema on the schema timeline. Consumerscan also be upgraded to any schema on the timeline.
Summary:
The important takeaway here is that the value in evolution is only realized when we have a mismatch in producer and consumer schema versions. This is by design. If we were to update all the affected producers/consumers every time we made a schema change, then we wouldn’t really need evolution, nor would we be utilizing its value. Or, put in a different way, if both producer and consumer schema were always forced to be the same, we wouldn’t really have evolution.
The backbone of evolution is the ability to replace a missing field with a default value. So far we’ve assumed that the priority of our system is to continue operating even if fields are missing. What if this is not what we want? Worse yet, what if by ‘papering’ over a missing field by replacing it with a default value, we run the risk of incorrect computations and polluted streams.
An important question to to ask is: Where, in the application, do we actually need default values?