Welcome! Log In Create A New Profile

Advanced

Large dataset in memory

Posted by parsa 
parsa
Large dataset in memory
September 29, 2010 06:40PM
Hey fellas,

I have a large key-value map that I want to serve in a web service
application. I want to keep a single instant of this map inside the
memory (around 600mb footprint) and let every request that is made to
the service use the very same object. I'm new to memcached and to be
honest, caching in general. So is it better to keep the object in the
memory as a whole or to add key-values to the cache separately? (btw
I'm using Scala on Lift)
Adam Lee
Re: Large dataset in memory
September 29, 2010 07:20PM
This is, essentially, precisely what memcached is. You can view memcached
as one large, shared map that should appear identical to all clients as long
as they are configured the same. It isn't "one object," but rather a
distributed cache shared equally amongst all the servers running the daemon,
but from the point of view of the clients/code, it basically looks like one
large map.

On Wed, Sep 29, 2010 at 12:18 PM, parsa <[email protected]> wrote:

> Hey fellas,
>
> I have a large key-value map that I want to serve in a web service
> application. I want to keep a single instant of this map inside the
> memory (around 600mb footprint) and let every request that is made to
> the service use the very same object. I'm new to memcached and to be
> honest, caching in general. So is it better to keep the object in the
> memory as a whole or to add key-values to the cache separately? (btw
> I'm using Scala on Lift)
>



--
awl
Adam Lee
Re: Large dataset in memory
September 29, 2010 07:30PM
Now that I think about it, though, it sounds like you don't actually want a
cache. Memcached is truly a cache, and is not guaranteed to keep your
values around.

Perhaps you want something more like TokyoTyrant or Redis. We (fotolog.com)
recently open-sourced our Scala client for Redis. You can take a look at
Redis at http://code.google.com/p/redis/ and our Scala client at
http://github.com/andreyk0/redis-client-scala-netty

http://github.com/andreyk0/redis-client-scala-nettyRedis is a key-value
store, rather than a cache, and it tries to be more ACID-like...

On Wed, Sep 29, 2010 at 12:18 PM, parsa <[email protected]> wrote:

> Hey fellas,
>
> I have a large key-value map that I want to serve in a web service
> application. I want to keep a single instant of this map inside the
> memory (around 600mb footprint) and let every request that is made to
> the service use the very same object. I'm new to memcached and to be
> honest, caching in general. So is it better to keep the object in the
> memory as a whole or to add key-values to the cache separately? (btw
> I'm using Scala on Lift)
>



--
awl
parsa
Re: Large dataset in memory
September 30, 2010 09:40AM
Redis sounds cool, can I put a prefix tree (Trie) like structure in
it ?


On Sep 30, 1:12 am, Adam Lee <[email protected]> wrote:
> Now that I think about it, though, it sounds like you don't actually want a
> cache.  Memcached is truly a cache, and is not guaranteed to keep your
> values around.
>
> Perhaps you want something more like TokyoTyrant or Redis.  We (fotolog..com)
> recently open-sourced our Scala client for Redis.  You can take a look at
> Redis athttp://code.google.com/p/redis/and our Scala client athttp://github.com/andreyk0/redis-client-scala-netty
>
> http://github.com/andreyk0/redis-client-scala-nettyRedis is a key-value
> store, rather than a cache, and it tries to be more ACID-like...
>
> On Wed, Sep 29, 2010 at 12:18 PM, parsa <[email protected]> wrote:
> > Hey fellas,
>
> > I have a large key-value map that I want to serve in a web service
> > application. I want to keep a single instant of this map inside the
> > memory (around 600mb footprint) and let every request that is made to
> > the service use the very same object. I'm new to memcached and to be
> > honest, caching in general. So is it better to keep the object in the
> > memory as a whole or to add key-values to the cache separately? (btw
> > I'm using Scala on Lift)
>
> --
> awl
ligerdave
Re: Large dataset in memory
September 30, 2010 05:10PM
migrating to Redis is great when you realized relational db is not
needed in your system.

i think your scope is to cache some objects in memory to boost the
performance.

i suspect that you already have a DB. am i right?

memcached is "perfect"(nothing is perfect) solution if you just wanna
add another layer on top of your existing web infrastructure. 600mb
isn't much actually. most likely, you wont even use that much. here is
how it works(not memcached, but more like how to use it):

1. request asking for object(s)
2. going to memcached asking for that/those object(s) (two cases here)
2.1 if the object(s) isnt/arent in the memory, fire a query to db
asking for the object(s) and store them in the memcached. you can
choose to set the object(s) to be in the memory for certain period of
time or forever
2.2 if the object(s) is/are in the memory, nothing need to be done
3. return the object(s)

so basically, the objects get loaded into memory only when first time
you asked for it. depends on how many of those objects you use in your
app, the space is less or equal to 600mb

wonderful thing about this solution is, you dont need to change
anything on you already have except change the DAO a little bit(adding
the caching strategy)


if your contents are "static"(once the dynamic pages get created and
dont change), i think you should look into server caching, which gives
you a performance boost the most. caching those pages in cache.
memcached is more an app level caching solution








On Sep 30, 3:33 am, parsa <[email protected]> wrote:
> Redis sounds cool, can I put a prefix tree (Trie) like structure in
> it ?
>
> On Sep 30, 1:12 am, Adam Lee <[email protected]> wrote:
>
>
>
> > Now that I think about it, though, it sounds like you don't actually want a
> > cache.  Memcached is truly a cache, and is not guaranteed to keep your
> > values around.
>
> > Perhaps you want something more like TokyoTyrant or Redis.  We (fotolog.com)
> > recently open-sourced our Scala client for Redis.  You can take a look at
> > Redis athttp://code.google.com/p/redis/andour Scala client athttp://github.com/andreyk0/redis-client-scala-netty
>
> > http://github.com/andreyk0/redis-client-scala-nettyRedis is a key-value
> > store, rather than a cache, and it tries to be more ACID-like...
>
> > On Wed, Sep 29, 2010 at 12:18 PM, parsa <[email protected]> wrote:
> > > Hey fellas,
>
> > > I have a large key-value map that I want to serve in a web service
> > > application. I want to keep a single instant of this map inside the
> > > memory (around 600mb footprint) and let every request that is made to
> > > the service use the very same object. I'm new to memcached and to be
> > > honest, caching in general. So is it better to keep the object in the
> > > memory as a whole or to add key-values to the cache separately? (btw
> > > I'm using Scala on Lift)
>
> > --
> > awl
Henrik Schröder
Re: Large dataset in memory
September 30, 2010 05:40PM
How do you generate this key-value map? All at once, or can you compute each
individual value given a key?

How do you use this map? All at once, or a few values for each web service
request?

How does the map change? All at once, or do you know which specific keys
need to be invalidated?

If you can re-compute single values easily, and if you only need a few of
them per request, and if you will invalidate single keys, then memcached is
a good fit for your project. Every time you need a value, you first check
the cache. If it's in the cache, great, you got it. If not, compute the
value, and put it in the cache. If a value changes, just remove it from
memcached, or compute it and put in the new value immediately.

How often does the map change? If it changes extremely rarely, you could
just cache the map in application memory on each individual webserver
instead, and have some mechanism for invalidating all of them at once.

Remember that memcached is a cache, it is not a permanent data store.
Putting an item into it in no way guarantees that you will get it out, it
only guarantees that if you get something out, it will be the latest version
of the item.


/Henrik

On Wed, Sep 29, 2010 at 18:18, parsa <[email protected]> wrote:

> Hey fellas,
>
> I have a large key-value map that I want to serve in a web service
> application. I want to keep a single instant of this map inside the
> memory (around 600mb footprint) and let every request that is made to
> the service use the very same object. I'm new to memcached and to be
> honest, caching in general. So is it better to keep the object in the
> memory as a whole or to add key-values to the cache separately? (btw
> I'm using Scala on Lift)
>
parsa
Re: Large dataset in memory
October 01, 2010 06:00AM
> i suspect that you already have a DB. am i right?

Yes that's where I'm getting the data from, it's on another server
though.

> How do you generate this key-value map? All at once, or can you compute each
> individual value given a key?

I generate it all at once from the DB and it's an expensive process.

> How do you use this map? All at once, or a few values for each web service
> request?

Each request only needs parts of the map, not all of it. But as the
number of simultaneous requests grows to somewhere near 500, there's a
chance of using 90% of the map.

> How does the map change? All at once, or do you know which specific keys
> need to be invalidated?

It doesn't change in run-time. It changes on a schedule once in a
month.

I think caching is not the way to go for me. I've looked into key-
value databases but the problem is the algorithm that's triggered with
each request (think of some searching) requires a specific type of
data which is a Trie or prefix tree. Currently, I generate the map
once in a singleton object inside the servlet container and give
references to it for each request and it works. But what I'm saying
is, maybe it's better to hold the data as a normal key-value map, then
when each request arrives, generate a Trie out of it and run the
algorithm with that Trie. (some sort of lazy loading)

Thanks for your tips, fellas.

On Sep 30, 11:38 pm, Henrik Schröder <[email protected]> wrote:
> How do you generate this key-value map? All at once, or can you compute each
> individual value given a key?
>
> How do you use this map? All at once, or a few values for each web service
> request?
>
> How does the map change? All at once, or do you know which specific keys
> need to be invalidated?
>
> If you can re-compute single values easily, and if you only need a few of
> them per request, and if you will invalidate single keys, then memcached is
> a good fit for your project. Every time you need a value, you first check
> the cache. If it's in the cache, great, you got it. If not, compute the
> value, and put it in the cache. If a value changes, just remove it from
> memcached, or compute it and put in the new value immediately.
>
> How often does the map change? If it changes extremely rarely, you could
> just cache the map in application memory on each individual webserver
> instead, and have some mechanism for invalidating all of them at once.
>
> Remember that memcached is a cache, it is not a permanent data store.
> Putting an item into it in no way guarantees that you will get it out, it
> only guarantees that if you get something out, it will be the latest version
> of the item.
>
> /Henrik
>
>
>
> On Wed, Sep 29, 2010 at 18:18, parsa <[email protected]> wrote:
> > Hey fellas,
>
> > I have a large key-value map that I want to serve in a web service
> > application. I want to keep a single instant of this map inside the
> > memory (around 600mb footprint) and let every request that is made to
> > the service use the very same object. I'm new to memcached and to be
> > honest, caching in general. So is it better to keep the object in the
> > memory as a whole or to add key-values to the cache separately? (btw
> > I'm using Scala on Lift)
Henrik Schröder
Re: Large dataset in memory
October 01, 2010 12:20PM
On Fri, Oct 1, 2010 at 05:59, parsa <[email protected]> wrote:

>
> > How do you generate this key-value map? All at once, or can you compute
> each
> > individual value given a key?
>
> I generate it all at once from the DB and it's an expensive process.
>
> > How does the map change? All at once, or do you know which specific keys
> > need to be invalidated?
>
> It doesn't change in run-time. It changes on a schedule once in a
> month.
>
>
Then you're correct, a key-value cache or datastore is not what you want.

Generate the data, save it as a generated blob on disk somewhere, and have
your application load the entire blob and cache it locally on each
webserver. It's only 600MB, so it should fit on each machine. If a machine
restarts, it can always load the generated blob again.


/Henrik
ligerdave
Re: Large dataset in memory
October 01, 2010 04:40PM
again, you dont need to generate all at once. whenever a request asked
for certain objects, get it in db(if not in cache) and store in cache.
loading locally means you would have wasted space for duplicates and
you need to face synchronization issues.

i think in your case, you can use a nosql(key-value db) solution
called mongodb. it has a memory-mapped file system and it supports
some easy queries to allow you doing some type sql-like operations.

the goal is to have all information loaded into memory(cached) to
speed up your app. how to smartly load the objects is the key.




On Sep 30, 11:59 pm, parsa <[email protected]> wrote:
> > i suspect that you already have a DB. am i right?
>
> Yes that's where I'm getting the data from, it's on another server
> though.
>
> > How do you generate this key-value map? All at once, or can you compute each
> > individual value given a key?
>
> I generate it all at once from the DB and it's an expensive process.
>
> > How do you use this map? All at once, or a few values for each web service
> > request?
>
> Each request only needs parts of the map, not all of it. But as the
> number of simultaneous requests grows to somewhere near 500, there's a
> chance of using 90% of the map.
>
> > How does the map change? All at once, or do you know which specific keys
> > need to be invalidated?
>
> It doesn't change in run-time. It changes on a schedule once in a
> month.
>
> I think caching is not the way to go for me. I've looked into key-
> value databases but the problem is the algorithm that's triggered with
> each request (think of some searching) requires a specific type of
> data which is a Trie or prefix tree. Currently, I generate the map
> once in a singleton object inside the servlet container and give
> references to it for each request and it works. But what I'm saying
> is, maybe it's better to hold the data as a normal key-value map, then
> when each request arrives, generate a Trie out of it and run the
> algorithm with that Trie. (some sort of lazy loading)
>
> Thanks for your tips, fellas.
>
> On Sep 30, 11:38 pm, Henrik Schröder <[email protected]> wrote:
>
>
>
> > How do you generate this key-value map? All at once, or can you compute each
> > individual value given a key?
>
> > How do you use this map? All at once, or a few values for each web service
> > request?
>
> > How does the map change? All at once, or do you know which specific keys
> > need to be invalidated?
>
> > If you can re-compute single values easily, and if you only need a few of
> > them per request, and if you will invalidate single keys, then memcached is
> > a good fit for your project. Every time you need a value, you first check
> > the cache. If it's in the cache, great, you got it. If not, compute the
> > value, and put it in the cache. If a value changes, just remove it from
> > memcached, or compute it and put in the new value immediately.
>
> > How often does the map change? If it changes extremely rarely, you could
> > just cache the map in application memory on each individual webserver
> > instead, and have some mechanism for invalidating all of them at once.
>
> > Remember that memcached is a cache, it is not a permanent data store.
> > Putting an item into it in no way guarantees that you will get it out, it
> > only guarantees that if you get something out, it will be the latest version
> > of the item.
>
> > /Henrik
>
> > On Wed, Sep 29, 2010 at 18:18, parsa <[email protected]> wrote:
> > > Hey fellas,
>
> > > I have a large key-value map that I want to serve in a web service
> > > application. I want to keep a single instant of this map inside the
> > > memory (around 600mb footprint) and let every request that is made to
> > > the service use the very same object. I'm new to memcached and to be
> > > honest, caching in general. So is it better to keep the object in the
> > > memory as a whole or to add key-values to the cache separately? (btw
> > > I'm using Scala on Lift)
James Phillips - Personal
Re: Large dataset in memory
October 01, 2010 04:50PM
Membase has memcached built in, if you are looking for a key-value database..
Provides in-memory caching like memcached, but also persists data to durable
media automatically. 100% on-the-wire compatible with memcached. Currently
lacks the query capability of MongoDB (beyond the obvious primary key based
"query") ... but not for long.

On Fri, Oct 1, 2010 at 7:38 AM, ligerdave <[email protected]> wrote:

> again, you dont need to generate all at once. whenever a request asked
> for certain objects, get it in db(if not in cache) and store in cache.
> loading locally means you would have wasted space for duplicates and
> you need to face synchronization issues.
>
> i think in your case, you can use a nosql(key-value db) solution
> called mongodb. it has a memory-mapped file system and it supports
> some easy queries to allow you doing some type sql-like operations.
>
> the goal is to have all information loaded into memory(cached) to
> speed up your app. how to smartly load the objects is the key.
>
>
>
>
> On Sep 30, 11:59 pm, parsa <[email protected]> wrote:
> > > i suspect that you already have a DB. am i right?
> >
> > Yes that's where I'm getting the data from, it's on another server
> > though.
> >
> > > How do you generate this key-value map? All at once, or can you compute
> each
> > > individual value given a key?
> >
> > I generate it all at once from the DB and it's an expensive process.
> >
> > > How do you use this map? All at once, or a few values for each web
> service
> > > request?
> >
> > Each request only needs parts of the map, not all of it. But as the
> > number of simultaneous requests grows to somewhere near 500, there's a
> > chance of using 90% of the map.
> >
> > > How does the map change? All at once, or do you know which specific
> keys
> > > need to be invalidated?
> >
> > It doesn't change in run-time. It changes on a schedule once in a
> > month.
> >
> > I think caching is not the way to go for me. I've looked into key-
> > value databases but the problem is the algorithm that's triggered with
> > each request (think of some searching) requires a specific type of
> > data which is a Trie or prefix tree. Currently, I generate the map
> > once in a singleton object inside the servlet container and give
> > references to it for each request and it works. But what I'm saying
> > is, maybe it's better to hold the data as a normal key-value map, then
> > when each request arrives, generate a Trie out of it and run the
> > algorithm with that Trie. (some sort of lazy loading)
> >
> > Thanks for your tips, fellas.
> >
> > On Sep 30, 11:38 pm, Henrik Schröder <[email protected]> wrote:
> >
> >
> >
> > > How do you generate this key-value map? All at once, or can you compute
> each
> > > individual value given a key?
> >
> > > How do you use this map? All at once, or a few values for each web
> service
> > > request?
> >
> > > How does the map change? All at once, or do you know which specific
> keys
> > > need to be invalidated?
> >
> > > If you can re-compute single values easily, and if you only need a few
> of
> > > them per request, and if you will invalidate single keys, then
> memcached is
> > > a good fit for your project. Every time you need a value, you first
> check
> > > the cache. If it's in the cache, great, you got it. If not, compute the
> > > value, and put it in the cache. If a value changes, just remove it from
> > > memcached, or compute it and put in the new value immediately.
> >
> > > How often does the map change? If it changes extremely rarely, you
> could
> > > just cache the map in application memory on each individual webserver
> > > instead, and have some mechanism for invalidating all of them at once..
> >
> > > Remember that memcached is a cache, it is not a permanent data store.
> > > Putting an item into it in no way guarantees that you will get it out,
> it
> > > only guarantees that if you get something out, it will be the latest
> version
> > > of the item.
> >
> > > /Henrik
> >
> > > On Wed, Sep 29, 2010 at 18:18, parsa <[email protected]> wrote:
> > > > Hey fellas,
> >
> > > > I have a large key-value map that I want to serve in a web service
> > > > application. I want to keep a single instant of this map inside the
> > > > memory (around 600mb footprint) and let every request that is made to
> > > > the service use the very same object. I'm new to memcached and to be
> > > > honest, caching in general. So is it better to keep the object in the
> > > > memory as a whole or to add key-values to the cache separately? (btw
> > > > I'm using Scala on Lift)
Adam Lee
Re: Large dataset in memory
October 01, 2010 05:20PM
On Thursday, September 30, 2010, parsa <[email protected]> wrote:
> Each request only needs parts of the map, not all of it. But as the
> number of simultaneous requests grows to somewhere near 500, there's a
> chance of using 90% of the map.
> It doesn't change in run-time. It changes on a schedule once in a
> month.
>
> I think caching is not the way to go for me. I've looked into key-
> value databases but the problem is the algorithm that's triggered with
> each request (think of some searching) requires a specific type of
> data which is a Trie or prefix tree. Currently, I generate the map
> once in a singleton object inside the servlet container and give
> references to it for each request and it works. But what I'm saying
> is, maybe it's better to hold the data as a normal key-value map, then
> when each request arrives, generate a Trie out of it and run the
> algorithm with that Trie. (some sort of lazy loading)

Generate each Trie, serialize it as a byte array and store it in the
key-value store. This eliminates any need to do computation at runtime
and any data duplication while still meeting all of your concurrency
and performancr needs. Redis can do this quite easily and it should be
very fast and simple to work with- if you choose to go down this route
and need any assistance, let me know!

Good luck

--
awl
Sorry, only registered users may post in this forum.

Click here to login