Data (de-) serialization is about converting between raw data items and instances of your Python classes, in order to transmit or store data. A typical example is the conversion between a Python object and its JSON string representation. This article discusses the two stand-alone frameworks marshmallow and Pydantic, which handle the conversion as well as data validation. I provide an introduction to each framework using a small example, compare marshmallow vs. pydantic and highlight their differences, and discuss a few caveats you should be aware of with both libraries.
Introduction
When communicating with other systems or persisting data, you need to serialize Python objects to a raw data format that can be transmitted or stored, and deserialize such raw data, converting it to Python objects again. Solutions built into Python, such as pickle
or json
, are limited. Pickle only works if serialization and deserialization are under your control, and you make sure to not ever change the package or module name of the serialized classes (as otherwise unpickling would break). Python’s json
module only converts between strings and Python dict
s, but does not convert those to objects of your own classes. Since working with instances of your own classes is much cleaner than working with dict
s, there is a need for this conversion. This is especially true if you work with external systems, such as a REST service, or parsing JSON, XML, or YAML files provided by others, which are not under your control. This article introduces the libraries marshmallow and Pydantic, which let you perform these steps with as little effort as possible. There are several other libraries that do similar things, but they are either less used or maintained at the time of writing, or belong to a larger framework (e.g. Django Rest Framework), which is an over-sized solution when considering only data serialization..
Conversion, data validation and schemas
Converting raw data to Python objects actually involves two steps, as the following figure illustrates:
First, data is converted from whatever raw form (binary or text) to a nested Python dict
, which only contains primitive data types, such as str
, float
, int
or bool
(and nested dict
and list
s thereof). The resulting dict
is given to marshmallow or Pydantic which validate the data. The basic assumption is that the incoming data comes from an untrusted source, making validation necessary. While Pydantic returns a Python object right away, marshmallow returns a cleaned, validated dict
. With marshmallow, the conversion from that cleaned dict
to an instance of complex Python class (e.g. one of your custom-made classes) is an optional step.
To make validation work, you need to define a schema. To understand schemas better, let’s take a look at an example:
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class Book:
title: str
isbn: str
@dataclass
class User:
email: str
books: Optional[List[Book]] = None
Code language: Python (python)
These simple dataclass
es do contain the correct types (for instance: str
for title
and isbn
), but a lot of information is missing. A book’s title may be limited to 120 characters in your application. Also, not every possible string is a valid ISBN number, or a valid email address. Consequently, you need to explicitly specify your implicit domain knowledge in the fields in some way. Let’s take a look how to do this with marshmallow and Pydantic.
Introduction to marshmallow deserialization
After having installed marshmallow
with pip
, let’s write our first schema definition with Marshmallow:
from marshmallow import Schema, fields, validate
def validate_isbn(isbn: str) -> None:
""" Validates the isbn with some code (omitted), raise marshmallow.ValidationError if validation did not pass """
class BookSchema(Schema):
title = fields.String(required=True, validate=validate.Length(max=120))
isbn = fields.String(required=True, validate=validate_isbn)
class UserSchema(Schema):
email = fields.Email(required=True)
books = fields.Nested(BookSchema, many=True, required=False)
Code language: Python (python)
As you can see, we have to create two new classes, BookSchema
and UserSchema
, which define the fields, their types and additional validators. Marshmallow ships with a large number of ready-to-use validators (e.g. Email
or Length
), and in this example we have defined our own ISBN-validator as a simple function (implementation omitted for brevity).
Let’s take a look at the deserialization process:
import json
raw_data_str = """{
"email": "foo@bar.com",
"books": [
{"title": "Hitchhikers Guide to the Galaxy", "isbn": "123456789"},
{"title": "Clean Coder", "isbn": "987654321"}
]
}"""
json_data = json.loads(raw_data_str)
schema = UserSchema()
user = schema.load(json_data) # raises if validation fails
Code language: Python (python)
The user
object looks exactly like json_data
, but it is the cleaned, validated version of it, containing only those fields UserSchema
knows about. To make marshmallow generate objects of our Book
and User
class, we have to add a post_load
hook method to each schema class, which looks as follows (just shown for UserSchema
):
from marshmallow import post_load
class UserSchema(Schema):
.... for other stuff see above ....
@post_load
def make_obj(self, data, **kwargs): # note: method name is arbitrary
return User(**data)
Code language: Python (python)
With this modification, user
would be a User
object whose books
attribute is a list
of Book
objects (assuming that you have also implemented the hook for BookSchema
).
While this approach provides a nice separation between the class definitions and the schema, it involves a lot of code. Fortunately, there is the Python package marshmallow-dataclass. Using it, you can achieve the same result by augmenting the User
and Book
dataclasses we introduced above, as follows:
import json
from dataclasses import dataclass, field
from typing import Optional, List
import marshmallow_dataclass
from marshmallow import validate
from marshmallow.fields import Email
@dataclass
class Book:
title: str = field(metadata=dict(required=True, validate=validate.Length(max=120)))
isbn: str = field(metadata=dict(required=True, validate=validate_isbn))
@dataclass
class User:
email: str = field(metadata={"marshmallow_field": Email()})
books: Optional[List[Book]] = None
UserSchema = marshmallow_dataclass.class_schema(User) # UserSchema is a class(!) object
user_schema_instance = UserSchema() # which we have to instantiate
json_data = json.loads(raw_data_str) # raw_data_str was defined above
user = user_schema_instance.load(json_data)
Code language: Python (python)
The resulting user
object is of type User
, because the marshmallow-dataclass library implements the post_load
hook for us.
Introduction to Pydantic deserialization
Pydantic is not a direct competitor to marshmallow, because the goal of Pydantic is to add validation to (schema-)objects throughout their lifetime. Pydantic basically applies dynamic type checking at run-time, e.g. when instantiating an object, at configurable levels. In contrast, marshmallow only applies type checking (including data validation) at specific points, whenever you call schema.load()
, schema.dump()
or schema.validate()
. Let’s see how the same example from above looks with Pydantic (make sure to run pip install pydantic
first):
from typing import List, Optional
from pydantic import BaseModel, constr, EmailStr, validator
class MyBaseModel(BaseModel):
""" A base model of which all our other models inherit. Provides basic configuration parameters. """
class Config:
validate_assignment = True
class BookModel(MyBaseModel):
title: constr(max_length=120) # constr = constrained string, other con... types exist
isbn: str
@validator('isbn')
def validate_isbn(cls, isbn):
""" Validates the isbn with some code (omitted) and return it, raise ValueError if validation did not pass """
return isbn
class UserModel(MyBaseModel):
email: EmailStr
books: Optional[List[BookModel]] = None
raw_data_str = ... # see above examples
json_data = json.loads(raw_data_str)
user = UserModel(**json_data)
user.email = "not-an-email" # raises pydantic.error_wrappers.ValidationError
Code language: Python (python)
Like marshmallow, Pydantic supports numerous data types beyond Python’s standard built-in types, such as email, URL, or payment card numbers. Be sure to closely study Pydantic’s manual, because there are some interesting caveats. For instance, Pydantic is actually not that pedantic when it comes to type matching. The call BookModel(title=1337, isbn=1234)
works (no validation error is raised), even though 1337
is a number and not a string. Pydantic converts provided data where no (or little) loss would occur. To avoid this behavior, you have to annotate the attributes using strict types, e.g. pydantic.StrictStr
instead of str
, as documented here.
Pydantic also lets you define your classes using a drop-in replacement for dataclass
es, as the following example illustrates:
from typing import List, Optional
from pydantic import constr, EmailStr, validator
from pydantic.dataclasses import dataclass
class Config:
validate_assignment = True
@dataclass(config=Config)
class Book:
title: constr(max_length=120)
isbn: str
@validator('isbn')
def validate_isbn(cls, isbn):
""" Validates the isbn with some code (omitted) and return it, raise ValueError if validation did not pass """
return isbn
@dataclass(config=Config)
class User:
email: EmailStr
books: Optional[List[Book]] = None
Code language: Python (python)
The deserialization process then produces objects of that class, as you would expect.
Serialization
Until now, the code examples demonstrated the deserialization use-case. If you want to serialize data, e.g. to transmit or store it, you just need to call a specific method provided by marshmallow or Pydantic to get a nested dict
, which you can then serialize to whatever form you want, e.g. to a JSON string or to some binary form, using pickle
, MessagePack, etc. If you use marshmallow, you need to call a schema’s dump(obj)
method. Note that this call will validate your data again, which may raise unexpected errors for objects you created manually (using their normal constructor, not using schema.load()
), with actually invalid data. With Pydantic you call the o.dict()
method on a model object o
which inherits from pydantic.BaseModel
, to get a nested dict
. Note that this dict
might still contain non-primitive types, such as datetime
objects, which many converters (including Python’s json
module) cannot handle. Consider using o.json()
instead.
Marshmallow vs. Pydantic – which one is better?
The choice comes down to a matter of personal preferences and needs. Let’s take a look at a few categories:
If all you need is serialization and deserialization at specific points in your application, I recommend marshmallow. Pydantic is a good choice if you want type safety throughout the whole lifetime of your objects at run-time, better interoperability with standards, or require very good run-time performance.
Data serialization caveats
Before I conclude, I’d like to talk about two caveats I faced while using these libraries.
Variable naming
When you receive raw, key-value-like data, e.g. in JSON form, the keys may follow all kinds of naming conventions. For instance, you may receive {"user-name": "Peter"}
(kebab case), {"userName": "Peter"}
(camel case), or {"user_name": "Peter"}
(snake case). In Python, however, snake case is common, and using any other convention would be extremely weird.
Fortunately, both marshmallow and Pydantic offer support to actually rename fields dynamically. See here for marshmallow and here for Pydantic. In case of Pydantic, the linked docs just cover deserialization – for serialization you simply need to call o.dict(by_alias=True)
instead of o.dict()
.
Data ambiguity and polymorphism
Data processing with both libraries works as expected, as long as the set of classes are deterministic. In particular, deserialization requires prior knowledge (provided by your code) about which schema to use to deserialize a nested dict
object you just obtained from somewhere. Let’s take a look at an example which is not deterministic, but ambiguous, because it involves inheritance:
@dataclass
class Animal:
name: str
@dataclass
class Tiger(Animal):
weight_lbs: int
@dataclass
class Elephant(Animal):
trunk_length_cm: int
@dataclass
class Zoo:
animals: List[Animal]
animals = [Elephant(name="Eldor", trunk_length_cm=176), Tiger(name="Roy", weight_lbs=405)]
zoo = Zoo(animals=animals)
Code language: Python (python)
Animal
is inherited by Tiger
and Elephant
. However, serialization considers Zoo.animals
to be just Animal
objects. Consequently, serializing zoo
would result in something like {'animals': [{'name': 'Eldor'}, {'name': 'Roy'}]}
. The information added by the sub-classes is missing.
Neither Pydantic nor Marshmallow support such kind of data. For marshmallow, there are a few helper projects, but I found them to be tedious to use. A better workaround is to add an attribute to all of your ambiguous classes, e.g. datatype
, that contains a string literal for each class. In this case, you also have to reduce ambiguity by explicitly stating a typing.Union
field that lists all possibly expected classes, as illustrated below, for marshmallow-dataclass
:
from marshmallow.fields import Constant
@dataclass
class Tiger(Animal):
weight_lbs: int
datatype: str = field(metadata=dict(marshmallow_field=Constant("tiger")), default="tiger")
@dataclass
class Elephant(Animal):
trunk_length_cm: int
datatype: str = field(metadata=dict(marshmallow_field=Constant("elephant")), default="elephant")
@dataclass
class Zoo:
animals: List[Union[Tiger, Elephant]] # formerly List[Animal]
Code language: Python (python)
However, while this approach makes data (de-)serialization of ambiguous data possible, it taints the schema with aspects that are specific to (de-)serialization.
Conclusion
Marshmallow and Pydantic are both very powerful frameworks. I suggest that you experiment with both of them, to get a handle on how they feel, how well they can be integrated into your existing code base, and understand their run-time performance. You need to decide early on whether you want to put class definition and schema definition into a single class (using a dataclass
), or whether to keep them separated. The latter is semantically cleaner, but involves writing more code.
Incorrect typing: `books: List[Book] = None`
It should be `books: Optional[List[Book]] = None`
Indeed, I updated the code. Thanks for pointing it out.
Thanks for this article. Useful!
I was curious about the polymorphism. https://github.com/pskd73/pydictable solves the problem mentioned above.