marshmallow vs. pydantic – Python’s 2 best libraries for data serialization and validation

Data (de-) serialization is about converting between raw data items and instances of your Python classes, in order to transmit or store data. A typical example is the conversion between a Python object and its JSON string representation. This article discusses the two stand-alone frameworks marshmallow and Pydantic, which handle the conversion as well as data validation. I provide an introduction to each framework using a small example, compare marshmallow vs. pydantic and highlight their differences, and discuss a few caveats you should be aware of with both libraries.

Introduction

When communicating with other systems or persisting data, you need to serialize Python objects to a raw data format that can be transmitted or stored, and deserialize such raw data, converting it to Python objects again. Solutions built into Python, such as pickle or json, are limited. Pickle only works if serialization and deserialization are under your control, and you make sure to not ever change the package or module name of the serialized classes (as otherwise unpickling would break). Python’s json module only converts between strings and Python dicts, but does not convert those to objects of your own classes. Since working with instances of your own classes is much cleaner than working with dicts, there is a need for this conversion. This is especially true if you work with external systems, such as a REST service, or parsing JSON, XML, or YAML files provided by others, which are not under your control. This article introduces the libraries marshmallow and Pydantic, which let you perform these steps with as little effort as possible. There are several other libraries that do similar things, but they are either less used or maintained at the time of writing, or belong to a larger framework (e.g. Django Rest Framework), which is an over-sized solution when considering only data serialization..

Conversion, data validation and schemas

Converting raw data to Python objects actually involves two steps, as the following figure illustrates:

First, data is converted from whatever raw form (binary or text) to a nested Python dict, which only contains primitive data types, such as str, float, int or bool (and nested dict and lists thereof). The resulting dict is given to marshmallow or Pydantic which validate the data. The basic assumption is that the incoming data comes from an untrusted source, making validation necessary. While Pydantic returns a Python object right away, marshmallow returns a cleaned, validated dict. With marshmallow, the conversion from that cleaned dict to an instance of complex Python class (e.g. one of your custom-made classes) is an optional step.

To make validation work, you need to define a schema. To understand schemas better, let’s take a look at an example:


from dataclasses import dataclass from typing import Optional, List @dataclass class Book: title: str isbn: str @dataclass class User: email: str books: Optional[List[Book]] = None
Code language: Python (python)

These simple dataclasses do contain the correct types (for instance: str for title and isbn), but a lot of information is missing. A book’s title may be limited to 120 characters in your application. Also, not every possible string is a valid ISBN number, or a valid email address. Consequently, you need to explicitly specify your implicit domain knowledge in the fields in some way. Let’s take a look how to do this with marshmallow and Pydantic.

Introduction to marshmallow deserialization

After having installed marshmallow with pip, let’s write our first schema definition with Marshmallow:

from marshmallow import Schema, fields, validate def validate_isbn(isbn: str) -> None: """ Validates the isbn with some code (omitted), raise marshmallow.ValidationError if validation did not pass """ class BookSchema(Schema): title = fields.String(required=True, validate=validate.Length(max=120)) isbn = fields.String(required=True, validate=validate_isbn) class UserSchema(Schema): email = fields.Email(required=True) books = fields.Nested(BookSchema, many=True, required=False)
Code language: Python (python)

As you can see, we have to create two new classes, BookSchema and UserSchema, which define the fields, their types and additional validators. Marshmallow ships with a large number of ready-to-use validators (e.g. Email or Length), and in this example we have defined our own ISBN-validator as a simple function (implementation omitted for brevity).

Let’s take a look at the deserialization process:

import json raw_data_str = """{ "email": "foo@bar.com", "books": [ {"title": "Hitchhikers Guide to the Galaxy", "isbn": "123456789"}, {"title": "Clean Coder", "isbn": "987654321"} ] }""" json_data = json.loads(raw_data_str) schema = UserSchema() user = schema.load(json_data) # raises if validation fails
Code language: Python (python)

The user object looks exactly like json_data, but it is the cleaned, validated version of it, containing only those fields UserSchema knows about. To make marshmallow generate objects of our Book and User class, we have to add a post_load hook method to each schema class, which looks as follows (just shown for UserSchema):

from marshmallow import post_load class UserSchema(Schema): .... for other stuff see above .... @post_load def make_obj(self, data, **kwargs): # note: method name is arbitrary return User(**data)
Code language: Python (python)

With this modification, user would be a User object whose books attribute is a list of Book objects (assuming that you have also implemented the hook for BookSchema).

While this approach provides a nice separation between the class definitions and the schema, it involves a lot of code. Fortunately, there is the Python package marshmallow-dataclass. Using it, you can achieve the same result by augmenting the User and Book dataclasses we introduced above, as follows:

import json from dataclasses import dataclass, field from typing import Optional, List import marshmallow_dataclass from marshmallow import validate from marshmallow.fields import Email @dataclass class Book: title: str = field(metadata=dict(required=True, validate=validate.Length(max=120))) isbn: str = field(metadata=dict(required=True, validate=validate_isbn)) @dataclass class User: email: str = field(metadata={"marshmallow_field": Email()}) books: Optional[List[Book]] = None UserSchema = marshmallow_dataclass.class_schema(User) # UserSchema is a class(!) object user_schema_instance = UserSchema() # which we have to instantiate json_data = json.loads(raw_data_str) # raw_data_str was defined above user = user_schema_instance.load(json_data)
Code language: Python (python)

The resulting user object is of type User, because the marshmallow-dataclass library implements the post_load hook for us.

Introduction to Pydantic deserialization

Pydantic is not a direct competitor to marshmallow, because the goal of Pydantic is to add validation to (schema-)objects throughout their lifetime. Pydantic basically applies dynamic type checking at run-time, e.g. when instantiating an object, at configurable levels. In contrast, marshmallow only applies type checking (including data validation) at specific points, whenever you call schema.load(), schema.dump() or schema.validate(). Let’s see how the same example from above looks with Pydantic (make sure to run pip install pydantic first):

from typing import List from pydantic import BaseModel, constr, EmailStr, validator class MyBaseModel(BaseModel): """ A base model of which all our other models inherit. Provides basic configuration parameters. """ class Config: validate_assignment = True class BookModel(MyBaseModel): title: constr(max_length=120) # constr = constrained string, other con... types exist isbn: str @validator('isbn') def validate_isbn(cls, isbn): """ Validates the isbn with some code (omitted) and return it, raise ValueError if validation did not pass """ return isbn class UserModel(MyBaseModel): email: EmailStr books: List[BookModel] = None raw_data_str = ... # see above examples json_data = json.loads(raw_data_str) user = UserModel(**json_data) user.email = "not-an-email" # raises pydantic.error_wrappers.ValidationError
Code language: Python (python)

Like marshmallow, Pydantic supports numerous data types beyond Python’s standard built-in types, such as email, URL, or payment card numbers. Be sure to closely study Pydantic’s manual, because there are some interesting caveats. For instance, Pydantic is actually not that pedantic when it comes to type matching. The call BookModel(title=1337, isbn=1234) works (no validation error is raised), even though 1337 is a number and not a string. Pydantic converts provided data where no (or little) loss would occur. To avoid this behavior, you have to annotate the attributes using strict types, e.g. pydantic.StrictStr instead of str, as documented here.

Pydantic also lets you define your classes using a drop-in replacement for dataclasses, as the following example illustrates:

from typing import List from pydantic import constr, EmailStr, validator from pydantic.dataclasses import dataclass class Config: validate_assignment = True @dataclass(config=Config) class Book: title: constr(max_length=120) isbn: str @validator('isbn') def validate_isbn(cls, isbn): """ Validates the isbn with some code (omitted) and return it, raise ValueError if validation did not pass """ return isbn @dataclass(config=Config) class User: email: EmailStr books: List[Book] = None
Code language: Python (python)

The deserialization process then produces objects of that class, as you would expect.

Serialization

Until now, the code examples demonstrated the deserialization use-case. If you want to serialize data, e.g. to transmit or store it, you just need to call a specific method provided by marshmallow or Pydantic to get a nested dict, which you can then serialize to whatever form you want, e.g. to a JSON string or to some binary form, using pickle, MessagePack, etc. If you use marshmallow, you need to call a schema’s dump(obj) method. Note that this call will validate your data again, which may raise unexpected errors for objects you created manually (using their normal constructor, not using schema.load()), with actually invalid data. With Pydantic you call the o.dict() method on a model object o which inherits from pydantic.BaseModel, to get a nested dict. Note that this dict might still contain non-primitive types, such as datetime objects, which many converters (including Python’s json module) cannot handle. Consider using o.json() instead.

Marshmallow vs. Pydantic – which one is better?

The choice comes down to a matter of personal preferences and needs. Let’s take a look at a few categories:

  • Popularity / stability: it’s a bad idea to choose a library which is not very popular and thus has a high risk of being abandoned. Both marshmallow and Pydantic are about equally popular, with ~5k stars on GitHub each. However, marshmallow development seems to be more on top of things, with ~90 vs. 235 open issues.
  • Performance: the authors of Pydantic built benchmarks which claim that Pydantic’s performance is superior to marshmallow’s. This advantage is only meaningful if you do a lot of data (de)serialization, in which case I recommend you measure the performance impact for one of your more complex schemas. Don’t ever blindly trust benchmarks made by others. In case you work with JSON input/output, you can also use ujson to speed up the conversion between raw str data and nested Python dicts, for both marshmallow and Pydantic.
  • Interoperability with standards: Pydantic comes with built-in support for generating OpenAPI or JSON schema definitions from your model code. There is also an officially endorsed generator tool which converts existing OpenAPI / JSON schema definitions to pydantic model classes. Marshmallow does not offer these capabilities. However, there are third-party projects like marshmallow-jsonschema and apispec which convert marshmallow schema classes to JSON schema or OpenAPI respectively – but not the other way around!

If all you need is serialization and deserialization at specific points in your application, I recommend marshmallow. Pydantic is a good choice if you want type safety throughout the whole lifetime of your objects at run-time, better interoperability with standards, or require very good run-time performance.

Data serialization caveats

Before I conclude, I’d like to talk about two caveats I faced while using these libraries.

Variable naming

When you receive raw, key-value-like data, e.g. in JSON form, the keys may follow all kinds of naming conventions. For instance, you may receive {"user-name": "Peter"} (kebab case), {"userName": "Peter"} (camel case), or {"user_name": "Peter"} (snake case). In Python, however, snake case is common, and using any other convention would be extremely weird.

Fortunately, both marshmallow and Pydantic offer support to actually rename fields dynamically. See here for marshmallow and here for Pydantic. In case of Pydantic, the linked docs just cover deserialization – for serialization you simply need to call o.dict(by_alias=True) instead of o.dict().

Data ambiguity and polymorphism

Data processing with both libraries works as expected, as long as the set of classes are deterministic. In particular, deserialization requires prior knowledge (provided by your code) about which schema to use to deserialize a nested dict object you just obtained from somewhere. Let’s take a look at an example which is not deterministic, but ambiguous, because it involves inheritance:

@dataclass class Animal: name: str @dataclass class Tiger(Animal): weight_lbs: int @dataclass class Elephant(Animal): trunk_length_cm: int @dataclass class Zoo: animals: List[Animal] animals = [Elephant(name="Eldor", trunk_length_cm=176), Tiger(name="Roy", weight_lbs=405)] zoo = Zoo(animals=animals)
Code language: Python (python)

Animal is inherited by Tiger and Elephant. However, serialization considers Zoo.animals to be just Animal objects. Consequently, serializing zoo would result in something like {'animals': [{'name': 'Eldor'}, {'name': 'Roy'}]}. The information added by the sub-classes is missing.

Neither Pydantic nor Marshmallow support such kind of data. For marshmallow, there are a few helper projects, but I found them to be tedious to use. A better workaround is to add an attribute to all of your ambiguous classes, e.g. datatype, that contains a string literal for each class. In this case, you also have to reduce ambiguity by explicitly stating a typing.Union field that lists all possibly expected classes, as illustrated below, for marshmallow-dataclass:

from marshmallow.fields import Constant @dataclass class Tiger(Animal): weight_lbs: int datatype: str = field(metadata=dict(marshmallow_field=Constant("tiger")), default="tiger") @dataclass class Elephant(Animal): trunk_length_cm: int datatype: str = field(metadata=dict(marshmallow_field=Constant("elephant")), default="elephant") @dataclass class Zoo: animals: List[Union[Tiger, Elephant]] # formerly List[Animal]
Code language: Python (python)

However, while this approach makes data (de-)serialization of ambiguous data possible, it taints the schema with aspects that are specific to (de-)serialization.

Conclusion

Marshmallow and Pydantic are both very powerful frameworks. I suggest that you experiment with both of them, to get a handle on how they feel, how well they can be integrated into your existing code base, and understand their run-time performance. You need to decide early on whether you want to put class definition and schema definition into a single class (using a dataclass), or whether to keep them separated. The latter is semantically cleaner, but involves writing more code.

Leave a Comment