AWS re:Invent 2020: What’s new in Amazon DocumentDB (with MongoDB compatibility)

Hi, I'm Brian Hess, a Senior
Specialist Solution Architect with DocumentDB.
I'm going to talk a bit today about what's new with DocumentDB
with MongoDB compatibility. So, we'll start kind of with
some of the major features, the kind of headline topics
that we had this last year, and we'll move on to some of
the other important additions that kind of flesh
out the DocumentDB feature set. And then we'll move on
to a nice little demo of some of the major features, and then that's sort of
what we'll cover today. So, just to remind folks,
Amazon DocumentDB is a fully managed, scalable document database
that supports MongoDB workloads. As we look at sort of,
the year at a glance, and what we've done
in the course of the last year, you can see here
a steady stream of features that have rolled out
over the course of the year.

All these really start with you, with the customers telling us
what are the features, and the instance types,
and the regions and the integrations with other AWS services
that are most important. We use that to set this roadmap, and then we put out features
on a regular basis to satisfy those most asked
for requirements. So, if you've got things that
you're most wanting with DocumentDB or integrations with other services,
make sure you let us know. That really helps us fresh out
a really strong roadmap throughout the year,
and when we sort of, endeavor to put this steady stream
of features in your hands. With that, let's start with some of
the big things that we've launched this year. Probably the one that's gotten us
the most charged up, and one of the biggest asks
that we've heard from our customers has been to support MongoDB
4.0 compatibility.

Previously, DocumentDB
supported MongoDB 3.6 compatibility
and that feature set, but MongoDB 4.0 has a number
of interesting features that we were hearing from customers
as to what they wanted. The first one here on the slide
is Change Streams. Change Streams is a way to tap
into the changes in a database, very much like
a Change Data Capture, then you can process those changes
and do other things with them, feed them downstream,
make a notification, etc. Previously, we had the ability
to watch a collection, and we had a window of the changes
that were retained for up to one day. The first thing we did
is we extended that window from one day up to seven days.

You can configure
that in your cluster. Previously, you had the ability
to start watching from now, or picking up a stream
that you had previously opened and pick up where you left off.
But there wasn't really, a way to jump into the stream
at a particular time, so we added
this Start At capability. The example here on the right,
we see towards the bottom that we're starting at noon
on October, 1st, so this is assuming
within those seven-day period. We're watching the changes
from that point on. We'll look at that a little bit
more in the demo, later on. The other thing we did is not just
watching collections themselves, but watching changes
at the database level. Watching them at the database level, we'll see all the changes to any of
the collections in that database.

We also have the ability
to watch at the cluster level, and you can see all the changes to all of the collections
in all of the databases, so a really nice way to look
at changes across the board. In addition to that with 4.0
compatibility, we also did a number
of performance improvements. We also improved role based access
control and a few new operators. But probably the biggest thing
that people were looking for with MongoDB 4.0 is transactions. Transactions are a really
nice way in databases to group a number
of writes together. The idea is that you will see those
as if they happen simultaneously.

That's a really nice way
to handle this sort of, atomic update of a database. We can do that by updating multiple
documents within the same collection, so we could consider this use case
we see at the bottom. Financial transactions is a pretty
common example people talk about. The idea is that
if we had a collection where the documents were the balances
for each of our customers, we would want to update
and take $400 away from John and give it to Carlos. We'd like to have
those viewed atomically.

We'd like to not have a customer or a user be able to view
an intermediate state. For instance, if we took
the money away from John before we gave it to Carlos, we wouldn't want someone
to come in the middle and notice that there's $400
that's just missing. Similarly, if we gave
the money to Carlos before we took it away from John, we wouldn't want to see $400
appear out of nowhere. To other applications
and other clients, they should see that the $400
moved as one unit. We also have the ability
to update multiple collections. A scenario there, we might have, say,
a hotel reservation-type scenario, or some kind of resource
that you're scheduling. In one collection,
we'd have the resource collection. Here, maybe it's rooms saying
when it's available and who has signed out the room, or who has made a reservation
for it on which day.

But we also have our users, and they're going to want
to have access to all of the rooms
for the reservations they've made. So, we'd want to reserve the room
and update the room collection, but we'd also want to go on
over to the user, in this case, Jane, and update her profile to say
that she had reserved that room. We wouldn't want to see
some intermediate state, we would just want to see those
two things happen simultaneously. Transactions allows us to do that. CloudWatch is an important part
of Amazon DocumentDB, and how we can see
what's going on in the cluster.

When we rolled out transactions,
we rolled out a number of metrics so you can get some visibility into
what's going on with transactions, and be able to set charts
and alarms if you need them. If we take a quick look at how
this appears to the developer, this is a JavaScript example.
If you're familiar with MongoDB, this is using the MongoDB APIs,
so it should look familiar. The first step here
is to create a session. A transaction occurs
within a session, so you start
this session up top and then you start a transaction
in that session. So, all of the updates
that are going to happen after that within that transaction
will be bound up together. We see here that we're going to
take $400 away from account one and we're going to give $400
to account two. Up until this point,
if you're outside of the transaction, if you're in another connection
to the database, you don't see either
of those operations.

Then we commit the transaction. When we commit the transaction,
once that's complete, now everybody observes those two
operations as if they've happened. Then we do some cleanup
with ending the session. So, pretty straightforward way
to do this. Again, this is a JavaScript example. I have an example in the demo
that's using Python in a very similar fashion. Beyond MongoDB 4.0, another couple of really important
additions to highlight here. The first of those
is the T3 instances. This is a really nice,
lightweight instance type, and has therefore, a nice price point
to go along with it. It's less than eight cents an hour, we like to say it's cheaper
than a cup of coffee a day. Two cores, four gigs of RAM. This is really a nice development
and test environment instance type. We got a lot of requests for having
lower powered instances for the cases when you didn't need performance,
like dev test and integration and capability
and functionality testing. So, T3 instances
fits that need very nicely. There are some workloads out there
that are just pretty modest, and they don't need to have
a really powerful instance.

This could apply for those
use cases just as well. Now, I did want to highlight, there is another feature
of DocumentDB that got rolled out previously,
which allows you to stop a cluster. From a cost-optimization
point of view, the idea of the lower
cost instances like T3s and this ability to stop a cluster is
a very attractive thing for dev test. The stock cluster feature allows you to stop all the instances
for a DocumentDB cluster, and you're not paying any instance
costs, but all the data stays.

Since we've separated
compute and storage, we can keep the storage
for all the data that has already been persisted
in the database and simply turn off the instances. When you're ready, you come back on,
you turn on the instances, they'll connect to the storage and you're back to basically
where you left off. This is really nice for saving
a few dollars with respect to dev
and test machines or clusters, because you can turn them off
at the end of the day and the user can go home and then
when they come back in the morning, restart the cluster,
and they're back in business. So, save a little money overnight,
save a little money over the weekend. But we see that as a really
attractive dev test environment. So, security is job zero here. Of course, we'd like to
highlight improvements in security. One of the things we rolled out
was role-based access control. These are support for built-in roles
that allow for various accesses. You can enable, read or read-write
access on the collections in a database. You can also do it
for all collections in all databases.

You can grant permissions
to allow you to create users or have other
administrative privileges. We manage these by a role that you
can grant inside of DocumentDB. Now, the AWS
Database Migration Service isn't really part of Document DB, but it's something we've seen
a lot paired with DocumentDB over the last year, so it's been a really important part
of the DocumentDB ecosystem. If you're not familiar with it, the DMS will allow you
to do a migration from various database sources
to various database syncs. We've seen a lot of migrations
from MongoDB to DocumentDB. Those can be done in two ways. You can do a one-time move
of all of the data that is already
in the source database, or you can tap into the Change
Data Capture feature, and allow for that data to stream
across to DocumentDB in an ongoing fashion. And then of course, you can combine
those two things to do both, if you wanted to move
the data that exists and then continue moving
any changes that happen.

This year, we added a couple
of new features. Previously, we supported MongoDB 3.6
as a source, and Amazon DocumentDB 3.6 as a sync. Now we added support for MongoDB 4.0
as a source and Amazon DocumentDB
3.6 as a source. This is the first time
we're seeing DocumentDB on the left side of this picture.
Of course, when we launched DocumentDB 4.0,
we enabled support for DocumentDB 4.0 as a sync, so that's also been
rolled into DMS as well. Also, in wanting to improve the bulk
move of the existing data, this full load scenario, we made investments to improve
that performance up to 3X faster.

So, get through the migration
quite a bit faster than previously. Those were kind of
the big ticket items. There's a lot of other features
that I want to highlight that really flesh out the service
and the ecosystem as a whole. MongoDB compatibility is
a big important part of DocumentDB. It's right there in the product name.
There was a number of investments we've made
over the course of the year, with respect to compatibility beyond
just the 4.0 compatibility, but even at the 3.6 level
that really sort of, again, flesh out the experience.

The first is execution statistics
for explained plans. This is really,
a way that you can understand a little bit more about
how the query plan is going to go, and get a better insight into how
your queries are going to perform. That'll allow you to refine
your queries, add new indexes, etc., to really dig in a little deeper
into the internals. Know characters and strings. If you've been around data before,
this is always one of those things that just comes up
and bite you in the ankle. It's kind of nice to be able to
handle the support for what happens if there's
no characters in the strings, and that's something we added
to DocumentDB this year as well. We also improved indexes
for $regex queries, as well as multi-key indexing.
If we look beyond that, I mentioned already some
of the change streams enhancements when we talked about DocumentDB 4.0. Those changes also apply
to DocumentDB 3.6. We have the ability to have
the longer window for the changes, so you have a little bit longer time
to process them, and also watching changes
at the database level as well as at the cluster level.

Also launched a number of new
operators, so some array operators,
arithmetic operators, and date operators,
as well as this output operator. The output operator is part
of the aggregation pipeline. It's typically, the last step
of the aggregation pipeline. What it does is it takes
the flow of the pipeline, all the transformations that you've
done in your aggregation pipeline, and you can write that
to another DocumentDB collection.

So, you can think of it
as sort of like, an insert select in the sequel world, but it's a really nice way
to take the output that you've done and instead of returning it
to a client, you can put it into a collection
that can then be queried on its own. It's nice for doing summaries, etc. I mentioned some of the security
stuff around role-based access control before. Another one in the security bucket
to talk about is Cluster Deletion Protection. What we really wanted to protect
against was the scenario where somebody accidentally
deletes an entire cluster, and you accidentally
delete terabytes worth of data.

Got a little bit of feedback
that it would be really handy if we could make it
just a little bit more difficult to do a full cluster delete, because it's something
you really want to be careful with. With cluster deletion protection, we have a flag on the cluster
in its configuration. When it's enabled,
you can't delete the cluster, and so deleting the cluster
becomes a two-step process. The first step is to go
and disable that flag, so modify the cluster configuration
to disable the flag, and then you can delete the cluster.

So, it's a really handy way
to just protect against it. It is possible to override that
and disable the flag on creation of the cluster. It's moving a little fast and loose,
so typically, it is defaulted
to being on for safety sake. On the resource side, we got some
feedback about wanting to increase some of these limits
that have been in DocumentDB. The first thing we did
is we increased the number of users you can create in a cluster by 10X, so you can create now up
to 1,000 users in the cluster.

Cursor limits are the sort of,
number of open cursors you can have to the cluster. Previously, it was sort of, one limit
for many different instance types. Now as the instance class
gets more powerful, we support more open cursors
concurrently, up to a total of 4,560
for our largest instance type. Similarly with connections,
connection limit has also been raised so that the more powerful instances can handle
more concurrent connections, up to 30,000 connections
for our most powerful instances. If we switch gears a little bit,
we also launched DocumentDB in four new regions,
so three different sort of, country regions around the world,
as well as GovCloud for US-West. This greatly expands the footprint
of DocumentDB throughout the world. Now, one of the key things that's a
nice differentiator for DocumentDB is the integration
with other AWS services. I already mentioned CloudWatch plays a really important role
for DocumentDB and other services
to really get an understanding of what's going on
inside of the cluster and inside the service itself. As we launch new features,
we launch new metrics, but we also launch new metrics
when we just get an understanding that it would be useful
to have a deeper view and be able to inspect
more of what DocumentDB is doing.

I already mentioned sort of,
increasing the limits for connections and cursors.
We also added metrics so that you can see how much
those resources are being used. Again, they're a limited resource, so it's good to be able
to monitor those. The MongoDB op counters is a way
that you can kind of take a look at what are the MongoDB API calls
that are being done to the cluster, so you can kind of get
an understanding of the workload. Are they reads? Are they writes? If they're writes, are they inserts
or are they deletes? Get a good picture of what's going on
from a workload point of view. And then a number of other metrics
that, again, sort of fill out the ability
to inspect and introspect DocumentDB. The last service I wanted to talk
a little bit about for integration is AWS Glue. Glue is an ETL service
that allows you to read and write data from various sources, do transformations of it
as it sort of moves along, and it allows for really flexible, powerful transformation tool
and data movement tool. Earlier in the year,
we launched the ability to read and write to DocumentDB
from within Glue.

This allows for some
special configuration that lets us leverage
the Glue context and some of the Glue constructs
to make reading and writing more efficient, and frankly,
simpler for users to use. That allows for us
to read from DocumentDB, transform it and do other things
with it, maybe write it to elsewhere, as well as take things
from other places, transform it and land
it in DocumentDB. One of the things that we
launched later in the year was the ability
to support the Crawler. Glue has a crawler, which will make
a connection to a data source and look at the data sets
that are in the data source, so in DocumentDB that's the databases
and the collections, go through and add those data sources
and data sets into the Glue catalog.

So, the Glue catalog is a way
to sort of just know where data is in your enterprise
throughout AWS, and be able to sort of have a name
for it, understand the connections. Now, you can leverage
the catalog in Glue jobs to have even simpler way of authoring
a Glue transformation job. In DocumentDB, it'll look at the
collections and actually scan through and try to infer the schema. Since DocumentDB is
a flexible schema model, that's actually
a little bit of a challenge. So, it's going to go through
and inspect the documents that are actually there,
infer that schema, and put it into the catalog, since Glue has a catalog
with a schema.

Once it's in there, like I said, it's very simple to now author
Glue jobs, either as DocumentDB
as a source or a destination, leveraging that registration
in the in the catalog. With that, I figure we transfer
over now and take a look at a demo on a couple of these
major features with DocumentDB. Okay, for this demo, we're going to
start by looking at transactions in Amazon DocumentDB Version 4. So, what I'm using here
is a SageMaker Notebook, a Jupyter Notebook that we've
connected on up to DocumentDB.

It's a nice way to do
some development of Python code. The first thing we do up here
is just import some packages and define some simple functions
that we'll just use. Now we're going to connect
to the database, and once we've made the connection,
we will run a simple command that shows that we're connected
to this three-node cluster. The next step is to insert some data
that we'll play with. What we're going to do is some sort of transactional
financial transactions. We've got three people here,
Andy, Brian, and Charlie, I've them each $1,000 to start. Now let's do some transactions.
Let's move some money around.

The first thing we'll do is we'll
transfer $600 from Brian to Andy. To do that,
we'll do it in two steps. First, we'll give money,
$600 to Andy. So, we see that Andy now
has $1,600. But it still looks like Brian
has $1,000, so if somebody were to
look in right now, they would think that
he's got more money than he does, and that maybe he could transfer,
say $500 to Charlie. But he doesn't have that money,
really, because we're in the middle
of this transfer of funds. If we finish this off, then you'll
see now Brian only has $400, Andy has $1,600, and if we tried
to give the $500 to Charlie, they're not available. Let's try to do this now atomically
with a transaction. In this case, from within
the transaction, as you're doing it, you'll see this intermediate state, but if you're outside
the transaction, you don't see any of the changes
until after.

The way that we do this is that first
we create a session object, and then we start a transaction
in that session. This my session object is the thing
that will indicate the context that we're looking for. What we'll do is we'll add these
two transactions in there, the first being to increment
Charlie by $300, and the next to decrement Brian
by $300. We pass along the session
that we're using to do this. So, from within the session,
we see now that Brian has $300 less and Charlie has $300 more. We do that when we do this find
by passing in again the session that we're in. If we're outside of the transaction,
if we go and do the find, and that's what we're showing here
in this next cell, from outside the transaction
still looks like Brian has $400, still looks like Charlie has $1,000.

That's because we have not
committed the transaction, yet. So, we're still
in an intermediate state, we could choose to do
some other things, but for now, let's just
finish off the transaction, and then when we do the find,
we see that everybody now is seeing that Brian has $100
and Charlie has $1,300. Now, what happens if we get partway
through a transaction, we decide we need to stop? Well, that's something we can do
by aborting the transaction. So, what we're doing here
is we're going to transfer $1,000 from Charlie to Brian. We're going to use this new session
two object to cover this transaction. But then what we're going
to do is abort. We're going to say,
"Actually, I don't want to do that. Let's back the whole thing out and leave
the database the way that it was." When we execute this, we see that
within the transaction we can get the $1,000
moved from Charlie to Brian, but then we decided to stop all that,
and by aborting the transaction, we now see that we're back
at where we were before.

So, transaction is really powerful, allow us to sort of isolate all of
these operations into one unit, and then get done atomically, either done together
or rolled back together, so other users and other views
of the data, sort of stay consistent and don't see sort of the work
in progress. All right, so that's transactions. Now, I want to show
a little bit on Change Streams. For Change Streams,
what we've got again, another notebook here where I've got
some stuff already filled in. I wanted to generate
the data ahead of time. Again, at the top,
we see we've got some packages and functions we're going to use, we make another connection
to the database, and we print out the simple
admin commands to show that we're connected. Now what we're doing is
we're enabling Change Streams. We're going to do this
by enabling it on all databases and all collections in the database.
We'll show that in a little bit, as to the various ways
that we can leverage that. And then I inserted some data. The idea here, we go back to Andy,
Brian, and Charlie.

This time,
they're ordering off of a menu. My menu items have hamburgers,
and cheeseburgers, and French fries,
sodas and milkshake, so sort of a fast food
kind of environment. We're going to generate this
by doing 40 iterations. We're sort of going to increment
the number of items they're buying
as we go through the iteration. I'm going to pick a random person
and a random number of items. We'll print out the timestamp
that we were using to do the insert. So, here we see that Charlie bought
a hamburger, and then two seconds later,
Brian bought two hamburgers, and then two seconds later, Andy bought three milkshakes,
four French fries, etc.

With Change Streams, what we'd like
to do is we'd like to jump on in. Previously, we could jump into
the end of the Change Stream and only process new events
that were coming in. Or if we had been
watching a collection, we could have saved off
kind of an intermediate token that says where we were in that
Change Stream and resume from it. But there wasn't a way to say that you want it to start
at a particular time. I'm going to copy this data over
just into this notepad on the right, so that we can see the data
as we move forward.

What we're going to do
is we're going to start at November, 24th, at 9:39 exactly. That should be sort of right
between these two events. The next thing we should see
is Charlie buying 14 French fries. Let me set this up.
We'll jump in there. We're going to watch this collection, we see here that we're doing
a collection watch. We're starting at the operation time,
which is this timestamp. So, if we get the next thing,
we see sure enough, there's Charlie
buying 14 French fries. We see that that was done a second
after we had started this Change Stream watch,
so from the middle.

If we were to say,
"Get the next two," we just have this stream
that we've opened. Let's just go try to get
the next two events. We see that the next event is Andy buying 15 milkshakes,
and that's correct. Then we see Charlie buying
16 French fries, and that's correct. Now, we showed how we could jump in
on a particular collection, at a particular time. One of the other things
that we added to DocumentDB was the ability to jump in
and watch at the database level. So, if we watch here and we create
a new stream at the database level, again, starting at the same time
as we've been doing as above, this time,
we'll be watching all of the changes for all of the collections
in this database.

If we run this, we see there's not
a whole lot else going on, so the event that we see is again, Charlie buying 14 French fries
a second after we jumped into this. And then lastly, one of the things
that we can do is actually watch the changes all the way
at the whole cluster level. So, this is all of the changes
across all of the collections in all of the databases.
If we do this and again, start at the same time,
again, not a very busy system so what we'll see again is
the Charlie buying 14 French fries. This is a really powerful way
to jump into the Change Stream at a particular time
of your choosing, and be able to process events
as they stream in after that. So, we got Change Streams
and Transactions, two really interesting and powerful
tools with DocumentDB this year. All right, so now if we take a look
at the sort of year at a glance, again, we see again, DocumentDB, we've rolled out a lot of
features over the course of the year, in terms of major features like
MongoDB Compatibility with 4.0, and Transactions,
as well as other features to flesh out the DocumentDB service.

We got new instance
type of T3 instances, great for dev test
and lighter workloads. Multiple regions have been rolled out
over the course of the year, three new regions and the GulfCloud
in the US-West. And then integration with
a number of AWS services. So, I really feel like there's
a lot of great new features here, really excited to see
what you all can build with that over the course of next year. Did want to take a quick call out
for a couple of other DocumentDB-related talks
during re:Invent. The first one here
is one from Zulily. I was going to talk about
what they've built with DocumentDB and kinesis data analytics, as well as a talk giving
a deep dive on DocumentDB, and another one about
migrating databases to DocumentDB.

With that, I want to thank you.
I look forward, again, to seeing what you all build
in the coming year..

You May Also Like