<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>petermao的技术blog &#187; redis</title>
	<atom:link href="http://www.petermao.com/category/redis/feed" rel="self" type="application/rss+xml" />
	<link>http://www.petermao.com</link>
	<description>欢迎探讨，共同进步</description>
	<lastBuildDate>Fri, 17 Feb 2017 07:03:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
		<item>
		<title>redis源代码分析25–VM（下）</title>
		<link>http://www.petermao.com/redis/121.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=121</link>
		<comments>http://www.petermao.com/redis/121.html#comments</comments>
		<pubDate>Sun, 01 May 2011 11:01:22 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=73</guid>
		<description><![CDATA[这一节介绍下redis中的多线程机制。 先看看多线程换出的机制。 serverCron函数中调用 vmSwapOneObjectThreaded开始多线程方式换出value，vmSwapOneObjectThreaded会调用 vmSwapOneObject（参看上一节的解释），而vmSwapOneObject最终会调用vmSwapObjectThreaded。 static int vmSwapObjectThreaded(robj *key, robj *val, redisDb *db) { iojob *j; assert(key-&#62;storage == REDIS_VM_MEMORY); assert(key-&#62;refcount == 1); j = zmalloc(sizeof(*j)); j-&#62;type = REDIS_IOJOB_PREPARE_SWAP; j-&#62;db = db; j-&#62;key = key; j-&#62;val = val; incrRefCount(val); j-&#62;canceled = &#8230; <a href="http://www.petermao.com/redis/121.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>这一节介绍下redis中的多线程机制。</p>
<p>先看看多线程换出的机制。</p>
<p>serverCron函数中调用 vmSwapOneObjectThreaded开始多线程方式换出value，vmSwapOneObjectThreaded会调用 vmSwapOneObject（参看上一节的解释），而vmSwapOneObject最终会调用vmSwapObjectThreaded。<span id="more-73"></span></p>
<pre class="wp-code-highlight prettyprint">
static int vmSwapObjectThreaded(robj *key, robj *val, redisDb *db) {
    iojob *j;

    assert(key-&gt;storage == REDIS_VM_MEMORY);
    assert(key-&gt;refcount == 1);

    j = zmalloc(sizeof(*j));
    j-&gt;type = REDIS_IOJOB_PREPARE_SWAP;
    j-&gt;db = db;
    j-&gt;key = key;
    j-&gt;val = val;
    incrRefCount(val);
    j-&gt;canceled = 0;
    j-&gt;thread = (pthread_t) -1;
    key-&gt;storage = REDIS_VM_SWAPPING;

    lockThreadedIO();
    queueIOJob(j);
    unlockThreadedIO();
    return REDIS_OK;
}
</pre>
<p>vmSwapObjectThreaded 会创建一个类型为REDIS_IOJOB_PREPARE_SWAP的job，然后使用queueIOJob来排队。而queueIOJob所做的主要工作就是就是将新job加入到server.io_newjobs，并在创建的线程数还没超过配置值时，创建新的线程。</p>
<pre class="wp-code-highlight prettyprint">
/* This function must be called while with threaded IO locked */
static void queueIOJob(iojob *j) {
    redisLog(REDIS_DEBUG,&quot;Queued IO Job %p type %d about key '%s'\n&quot;,
        (void*)j, j-&gt;type, (char*)j-&gt;key-&gt;ptr);
    listAddNodeTail(server.io_newjobs,j);
    if (server.io_active_threads &lt; server.vm_max_threads)
        spawnIOThread();
}
</pre>
<p>从spawnIOThread中可以知道，新线程的入口点是IOThreadEntryPoint。</p>
<pre class="wp-code-highlight prettyprint">
static void spawnIOThread(void) {
    pthread_t thread;
    sigset_t mask, omask;
    int err;

    sigemptyset(&amp;mask);
    sigaddset(&amp;mask,SIGCHLD);
    sigaddset(&amp;mask,SIGHUP);
    sigaddset(&amp;mask,SIGPIPE);
    pthread_sigmask(SIG_SETMASK, &amp;mask, &amp;omask);
    while ((err = pthread_create(&amp;thread,&amp;server.io_threads_attr,IOThreadEntryPoint,NULL)) != 0) {
        redisLog(REDIS_WARNING,&quot;Unable to spawn an I/O thread: %s&quot;,
            strerror(err));
        usleep(1000000);
    }
    pthread_sigmask(SIG_SETMASK, &amp;omask, NULL);
    server.io_active_threads++;
}
</pre>
<p>IOThreadEntryPoint会将io_newjobs中的job移入server.io_processing，然后在做完job类型的工作后（加载value/计算value所需交换页数/换出value），将job从server.io_processing移入io_processed中。然后往 server.io_ready_pipe_write所在的管道（io_ready_pipe_read、io_ready_pipe_write组成管道的两端）写入一个字节，让睡眠中的vmThreadedIOCompletedJob继续运行，该函数会做些后续工作。</p>
<pre class="wp-code-highlight prettyprint">
static void *IOThreadEntryPoint(void *arg) {
    iojob *j;
    listNode *ln;
    REDIS_NOTUSED(arg);

    pthread_detach(pthread_self());
    while(1) {
        /* Get a new job to process */
        lockThreadedIO();
        if (listLength(server.io_newjobs) == 0) {
            /* No new jobs in queue, exit. */
            redisLog(REDIS_DEBUG,&quot;Thread %ld exiting, nothing to do&quot;,
                (long) pthread_self());
            server.io_active_threads--;
            unlockThreadedIO();
            return NULL;
        }
        ln = listFirst(server.io_newjobs);
        j = ln-&gt;value;
        listDelNode(server.io_newjobs,ln);
        /* Add the job in the processing queue */
        j-&gt;thread = pthread_self();
        listAddNodeTail(server.io_processing,j);
        ln = listLast(server.io_processing); /* We use ln later to remove it */
        unlockThreadedIO();
        redisLog(REDIS_DEBUG,&quot;Thread %ld got a new job (type %d): %p about key '%s'&quot;,
            (long) pthread_self(), j-&gt;type, (void*)j, (char*)j-&gt;key-&gt;ptr);

        /* Process the Job */
        if (j-&gt;type == REDIS_IOJOB_LOAD) {
            j-&gt;val = vmReadObjectFromSwap(j-&gt;page,j-&gt;key-&gt;vtype);
        } else if (j-&gt;type == REDIS_IOJOB_PREPARE_SWAP) {
            FILE *fp = fopen(&quot;/dev/null&quot;,&quot;w+&quot;);
            j-&gt;pages = rdbSavedObjectPages(j-&gt;val,fp);
            fclose(fp);
        } else if (j-&gt;type == REDIS_IOJOB_DO_SWAP) {
            if (vmWriteObjectOnSwap(j-&gt;val,j-&gt;page) == REDIS_ERR)
                j-&gt;canceled = 1;
        }

        /* Done: insert the job into the processed queue */
        redisLog(REDIS_DEBUG,&quot;Thread %ld completed the job: %p (key %s)&quot;,
            (long) pthread_self(), (void*)j, (char*)j-&gt;key-&gt;ptr);
        lockThreadedIO();
        listDelNode(server.io_processing,ln);
        listAddNodeTail(server.io_processed,j);
        unlockThreadedIO();

        /* Signal the main thread there is new stuff to process */
        assert(write(server.io_ready_pipe_write,&quot;x&quot;,1) == 1);
    }
    return NULL; /* never reached */
}

static void vmThreadedIOCompletedJob(aeEventLoop *el, int fd, void *privdata,
            int mask)
{
    char buf[1];
    int retval, processed = 0, toprocess = -1, trytoswap = 1;
    REDIS_NOTUSED(el);
    REDIS_NOTUSED(mask);
    REDIS_NOTUSED(privdata);

    if (privdata != NULL) trytoswap = 0; /* check the comments above... */

    /* For every byte we read in the read side of the pipe, there is one
     * I/O job completed to process. */
    while((retval = read(fd,buf,1)) == 1) {
        iojob *j;
        listNode *ln;
        robj *key;
        struct dictEntry *de;

        redisLog(REDIS_DEBUG,&quot;Processing I/O completed job&quot;);

        /* Get the processed element (the oldest one) */
        lockThreadedIO();
        assert(listLength(server.io_processed) != 0);
        if (toprocess == -1) {
            toprocess = (listLength(server.io_processed)*REDIS_MAX_COMPLETED_JOBS_PROCESSED)/100;
            if (toprocess &lt;= 0) toprocess = 1;
        }
        ln = listFirst(server.io_processed);
        j = ln-&gt;value;
        listDelNode(server.io_processed,ln);
        unlockThreadedIO();
        /* If this job is marked as canceled, just ignore it */
        if (j-&gt;canceled) {
            freeIOJob(j);
            continue;
        }
        /* Post process it in the main thread, as there are things we
         * can do just here to avoid race conditions and/or invasive locks */
        redisLog(REDIS_DEBUG,&quot;Job %p type: %d, key at %p (%s) refcount: %d\n&quot;, (void*) j, j-&gt;type, (void*)j-&gt;key, (char*)j-&gt;key-&gt;ptr, j-&gt;key-&gt;refcount);
        de = dictFind(j-&gt;db-&gt;dict,j-&gt;key);
        assert(de != NULL);
        key = dictGetEntryKey(de);
        if (j-&gt;type == REDIS_IOJOB_LOAD) {
            redisDb *db;

            /* Key loaded, bring it at home */
            key-&gt;storage = REDIS_VM_MEMORY;
            key-&gt;vm.atime = server.unixtime;
            vmMarkPagesFree(key-&gt;vm.page,key-&gt;vm.usedpages);
            redisLog(REDIS_DEBUG, &quot;VM: object %s loaded from disk (threaded)&quot;,
                (unsigned char*) key-&gt;ptr);
            server.vm_stats_swapped_objects--;
            server.vm_stats_swapins++;
            dictGetEntryVal(de) = j-&gt;val;
            incrRefCount(j-&gt;val);
            db = j-&gt;db;
            freeIOJob(j);
            /* Handle clients waiting for this key to be loaded. */
            handleClientsBlockedOnSwappedKey(db,key);
        } else if (j-&gt;type == REDIS_IOJOB_PREPARE_SWAP) {
            /* Now we know the amount of pages required to swap this object.
             * Let's find some space for it, and queue this task again
             * rebranded as REDIS_IOJOB_DO_SWAP. */
            if (!vmCanSwapOut() ||
                vmFindContiguousPages(&amp;j-&gt;page,j-&gt;pages) == REDIS_ERR)
            {
                /* Ooops... no space or we can't swap as there is
                 * a fork()ed Redis trying to save stuff on disk. */
                freeIOJob(j);
                key-&gt;storage = REDIS_VM_MEMORY; /* undo operation */
            } else {
                /* Note that we need to mark this pages as used now,
                 * if the job will be canceled, we'll mark them as freed
                 * again. */
                vmMarkPagesUsed(j-&gt;page,j-&gt;pages);
                j-&gt;type = REDIS_IOJOB_DO_SWAP;
                lockThreadedIO();
                queueIOJob(j);
                unlockThreadedIO();
            }
        } else if (j-&gt;type == REDIS_IOJOB_DO_SWAP) {
            robj *val;

            /* Key swapped. We can finally free some memory. */
            if (key-&gt;storage != REDIS_VM_SWAPPING) {
                printf(&quot;key-&gt;storage: %d\n&quot;,key-&gt;storage);
                printf(&quot;key-&gt;name: %s\n&quot;,(char*)key-&gt;ptr);
                printf(&quot;key-&gt;refcount: %d\n&quot;,key-&gt;refcount);
                printf(&quot;val: %p\n&quot;,(void*)j-&gt;val);
                printf(&quot;val-&gt;type: %d\n&quot;,j-&gt;val-&gt;type);
                printf(&quot;val-&gt;ptr: %s\n&quot;,(char*)j-&gt;val-&gt;ptr);
            }
            redisAssert(key-&gt;storage == REDIS_VM_SWAPPING);
            val = dictGetEntryVal(de);
            key-&gt;vm.page = j-&gt;page;
            key-&gt;vm.usedpages = j-&gt;pages;
            key-&gt;storage = REDIS_VM_SWAPPED;
            key-&gt;vtype = j-&gt;val-&gt;type;
            decrRefCount(val); /* Deallocate the object from memory. */
            dictGetEntryVal(de) = NULL;
            redisLog(REDIS_DEBUG,
                &quot;VM: object %s swapped out at %lld (%lld pages) (threaded)&quot;,
                (unsigned char*) key-&gt;ptr,
                (unsigned long long) j-&gt;page, (unsigned long long) j-&gt;pages);
            server.vm_stats_swapped_objects++;
            server.vm_stats_swapouts++;
            freeIOJob(j);
            /* Put a few more swap requests in queue if we are still
             * out of memory */
            if (trytoswap &amp;&amp; vmCanSwapOut() &amp;&amp;
                zmalloc_used_memory() &gt; server.vm_max_memory)
            {
                int more = 1;
                while(more) {
                    lockThreadedIO();
                    more = listLength(server.io_newjobs) &lt;
                            (unsigned) server.vm_max_threads;
                    unlockThreadedIO();
                    /* Don't waste CPU time if swappable objects are rare. */
                    if (vmSwapOneObjectThreaded() == REDIS_ERR) {
                        trytoswap = 0;
                        break;
                    }
                }
            }
        }
        processed++;
        if (processed == toprocess) return;
    }
    if (retval &lt; 0 &amp;&amp; errno != EAGAIN) {
        redisLog(REDIS_WARNING,
            &quot;WARNING: read(2) error in vmThreadedIOCompletedJob() %s&quot;,
            strerror(errno));
    }
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/121.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>redis源代码分析24–VM（中）</title>
		<link>http://www.petermao.com/redis/118.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=118</link>
		<comments>http://www.petermao.com/redis/118.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:57:48 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=71</guid>
		<description><![CDATA[VM根据value换进换出的策略又有两种使用方式：阻塞方式和多线程方式（server.vm_max_threads == 0为阻塞方式）。 这一节主要介绍阻塞方式。 redis 启动重建db（aof方式或者快照方式）时，可能会因为内存限制将某些value换出到磁盘，此时只使用阻塞方式换出 value（vmSwapOneObjectBlocking函数）。除此之外，redis只在serverCron函数（该函数事件处理章节分析过）中换出value。我们来看看serverCron中的处理代码，阻塞方式使用函数vmSwapOneObjectBlocking换出value，多线程方式使用函数vmSwapOneObjectThreaded换出value。 static int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { --- /* Swap a few keys on disk if we are over the memory limit and VM * is enbled. Try to &#8230; <a href="http://www.petermao.com/redis/118.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>VM根据value换进换出的策略又有两种使用方式：阻塞方式和多线程方式（server.vm_max_threads == 0为阻塞方式）。</p>
<p>这一节主要介绍阻塞方式。</p>
<p>redis 启动重建db（aof方式或者快照方式）时，可能会因为内存限制将某些value换出到磁盘，此时只使用阻塞方式换出 value（vmSwapOneObjectBlocking函数）。除此之外，redis只在serverCron函数（该函数事件处理章节分析过）中换出value。我们来看看serverCron中的处理代码，阻塞方式使用函数vmSwapOneObjectBlocking换出value，多线程方式使用函数vmSwapOneObjectThreaded换出value。<span id="more-71"></span></p>
<pre class="wp-code-highlight prettyprint">
static int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
    ---
    /* Swap a few keys on disk if we are over the memory limit and VM
     * is enbled. Try to free objects from the free list first. */
    if (vmCanSwapOut()) {
        while (server.vm_enabled &amp;&amp; zmalloc_used_memory() &gt;
                server.vm_max_memory)
        {
            ---
            if (tryFreeOneObjectFromFreelist() == REDIS_OK) continue;
            retval = (server.vm_max_threads == 0) ?
                        vmSwapOneObjectBlocking() :
                        vmSwapOneObjectThreaded();
            ---
        }
    }
    ---
    return 100;
}
</pre>
<p>无论是阻塞方式vmSwapOneObjectBlocking换出value，还是多线程方式vmSwapOneObjectThreaded换出value，最终都调用vmSwapOneObject（调用参数不一样）来换出value。</p>
<p>vmSwapOneObject会对每个db，随机选择5项，计算它的swappability，然后如果是多线程方式，则调用vmSwapObjectThreaded来换出value，否则使用vmSwapObjectBlocking换出value。</p>
<pre class="wp-code-highlight prettyprint">
static int vmSwapOneObject(int usethreads) {
    int j, i;
    struct dictEntry *best = NULL;
    double best_swappability = 0;
    redisDb *best_db = NULL;
    robj *key, *val;

    for (j = 0; j &lt; server.dbnum; j++) {
        redisDb *db = server.db+j;
        /* Why maxtries is set to 100?
         * Because this way (usually) we'll find 1 object even if just 1% - 2%
         * are swappable objects */
        int maxtries = 100;

        if (dictSize(db-&gt;dict) == 0) continue;
        for (i = 0; i &lt; 5; i++) {
            dictEntry *de;
            double swappability;

            if (maxtries) maxtries--;
            de = dictGetRandomKey(db-&gt;dict);
            key = dictGetEntryKey(de);
            val = dictGetEntryVal(de);
            /* Only swap objects that are currently in memory.
             *
             * Also don't swap shared objects if threaded VM is on, as we
             * try to ensure that the main thread does not touch the
             * object while the I/O thread is using it, but we can't
             * control other keys without adding additional mutex. */
            if (key-&gt;storage != REDIS_VM_MEMORY ||
                (server.vm_max_threads != 0 &amp;&amp; val-&gt;refcount != 1)) {
                if (maxtries) i--; /* don't count this try */
                continue;
            }
            val-&gt;vm.atime = key-&gt;vm.atime; /* atime is updated on key object */
            swappability = computeObjectSwappability(val);
            if (!best || swappability &gt; best_swappability) {
                best = de;
                best_swappability = swappability;
                best_db = db;
            }
        }
    }
    if (best == NULL) return REDIS_ERR;
    key = dictGetEntryKey(best);
    val = dictGetEntryVal(best);

    redisLog(REDIS_DEBUG,&quot;Key with best swappability: %s, %f&quot;,
        key-&gt;ptr, best_swappability);

    /* Unshare the key if needed */
    if (key-&gt;refcount &gt; 1) {
        robj *newkey = dupStringObject(key);
        decrRefCount(key);
        key = dictGetEntryKey(best) = newkey;
    }
    /* Swap it */
    if (usethreads) {
        vmSwapObjectThreaded(key,val,best_db);
        return REDIS_OK;
    } else {
        if (vmSwapObjectBlocking(key,val) == REDIS_OK) {
            dictGetEntryVal(best) = NULL;
            return REDIS_OK;
        } else {
            return REDIS_ERR;
        }
    }
}
</pre>
<p>vmSwapObjectBlocking会在计算所需的交换页后，阻塞性的将value写到vm文件中（函数vmWriteObjectOnSwap），最后标记相应vm页为已使用。</p>
<pre class="wp-code-highlight prettyprint">
static int vmSwapObjectBlocking(robj *key, robj *val) {
    off_t pages = rdbSavedObjectPages(val,NULL);
    off_t page;

    assert(key-&gt;storage == REDIS_VM_MEMORY);
    assert(key-&gt;refcount == 1);
    if (vmFindContiguousPages(&amp;page,pages) == REDIS_ERR) return REDIS_ERR;
    if (vmWriteObjectOnSwap(val,page) == REDIS_ERR) return REDIS_ERR;
    key-&gt;vm.page = page;
    key-&gt;vm.usedpages = pages;
    key-&gt;storage = REDIS_VM_SWAPPED;
    key-&gt;vtype = val-&gt;type;
    decrRefCount(val); /* Deallocate the object from memory. */
    vmMarkPagesUsed(page,pages);
    redisLog(REDIS_DEBUG,&quot;VM: object %s swapped out at %lld (%lld pages)&quot;,
        (unsigned char*) key-&gt;ptr,
        (unsigned long long) page, (unsigned long long) pages);
    server.vm_stats_swapped_objects++;
    server.vm_stats_swapouts++;
    return REDIS_OK;
}
</pre>
<p>对于value的加载，如果是多线程方式，会使用blockClientOnSwappedKeys提前加载，但阻塞方式则只有到相应命令执行时才会加载。最终无论是阻塞方式还是多线程方式，都会调用lookupKey来查找key是否在内存中，若不在，则使用vmLoadObject加载value，该函数是阻塞式的读入value。</p>
<pre class="wp-code-highlight prettyprint">
static robj *lookupKey(redisDb *db, robj *key) {
    dictEntry *de = dictFind(db-&gt;dict,key);
    if (de) {
        robj *key = dictGetEntryKey(de);
        robj *val = dictGetEntryVal(de);

        if (server.vm_enabled) {
            if (key-&gt;storage == REDIS_VM_MEMORY ||
                key-&gt;storage == REDIS_VM_SWAPPING)
            {
                /* If we were swapping the object out, stop it, this key
                 * was requested. */
                if (key-&gt;storage == REDIS_VM_SWAPPING)
                    vmCancelThreadedIOJob(key);
                /* Update the access time of the key for the aging algorithm. */
                key-&gt;vm.atime = server.unixtime;
            } else {
                int notify = (key-&gt;storage == REDIS_VM_LOADING);

                /* Our value was swapped on disk. Bring it at home. */
                redisAssert(val == NULL);
                val = vmLoadObject(key);
                dictGetEntryVal(de) = val;

                /* Clients blocked by the VM subsystem may be waiting for
                 * this key... */
                if (notify) handleClientsBlockedOnSwappedKey(db,key);
            }
        }
        return val;
    } else {
        return NULL;
    }
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/118.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>redis源代码分析23–VM（上）</title>
		<link>http://www.petermao.com/redis/116.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=116</link>
		<comments>http://www.petermao.com/redis/116.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:56:36 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=69</guid>
		<description><![CDATA[VM是Redis2.0新增的一个功能。在没有VM之前，redis会把db中的所有数据放在内存中。随着redis的不断运行，所使用的内存会越来越大。但同时，client对某些数据的访问频度明显会比其他数据高。redis引入VM功能来试图解决这个问题。简言之，VM使得redis会把很少访问的value保存到磁盘中。但同时，所有value的key都放在内存中，这是为了让被换出的value的查找在启用VM前后性能差不多。 VM在redis中算是redis中最复杂的模块之一，我们分三节来介绍。这一节介绍redis的主要数据结构，下一节介绍非阻塞方式，最后一节介绍多线程方式。 我们先来看看redis中的通用对象结构redisObject ： // VM启用时, 对象所处位置 #define REDIS_VM_MEMORY 0 /* The object is on memory */ #define REDIS_VM_SWAPPED 1 /* The object is on disk */ #define REDIS_VM_SWAPPING 2 /* Redis is swapping this object on disk */ #define &#8230; <a href="http://www.petermao.com/redis/116.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>VM是Redis2.0新增的一个功能。在没有VM之前，redis会把db中的所有数据放在内存中。随着redis的不断运行，所使用的内存会越来越大。但同时，client对某些数据的访问频度明显会比其他数据高。redis引入VM功能来试图解决这个问题。简言之，VM使得redis会把很少访问的value保存到磁盘中。但同时，所有value的key都放在内存中，这是为了让被换出的value的查找在启用VM前后性能差不多。</p>
<p>VM在redis中算是redis中最复杂的模块之一，我们分三节来介绍。这一节介绍redis的主要数据结构，下一节介绍非阻塞方式，最后一节介绍多线程方式。</p>
<p>我们先来看看redis中的通用对象结构redisObject ：<span id="more-69"></span></p>
<pre class="wp-code-highlight prettyprint">
// VM启用时, 对象所处位置
#define REDIS_VM_MEMORY 0       /* The object is on memory */
#define REDIS_VM_SWAPPED 1      /* The object is on disk */
#define REDIS_VM_SWAPPING 2     /* Redis is swapping this object on disk */
#define REDIS_VM_LOADING 3      /* Redis is loading this object from disk */

/* The VM object structure */
struct redisObjectVM {
    off_t page;         /* the page at witch the object is stored on disk */
    off_t usedpages;    /* number of pages used on disk */
    time_t atime;       /* Last access time */
} vm;

/* The actual Redis Object */
// 通用类型
// 对于key，需额外标志保存value的位置、类型等
typedef struct redisObject {
    void *ptr;
    unsigned char type;
    unsigned char encoding;
    unsigned char storage;  /* If this object is a key, where is the value?
                             * REDIS_VM_MEMORY, REDIS_VM_SWAPPED, ... */
    unsigned char vtype; /* If this object is a key, and value is swapped out,
                          * this is the type of the swapped out object. */
    int refcount;
    /* VM fields, this are only allocated if VM is active, otherwise the
     * object allocation function will just allocate
     * sizeof(redisObjct) minus sizeof(redisObjectVM), so using
     * Redis without VM active will not have any overhead. */
    struct redisObjectVM vm;
} robj;
</pre>
<p>robj 中的type保存了对象的类型，如string、list、set等。storage保存了该key对象对应的value所处的位置：内存、磁盘、正在被换出到磁盘，正在加载。vtype表示该key对象所对应的value的类型。page和usedpages保存了该key对象所对应的 value，atime是value的最后一次访问时间。因此，当robj所表示的key对象的storage类型为REDIS_VM_SWAPPED 时，就表示该key的value已不在内存中，需从VM中page的位置加载该value，vaue的类型为vtype，大小为usedpages。</p>
<p>创建对象的时候，根据是否启用VM机制，来分配合适大小的robj对象大小。</p>
<pre class="wp-code-highlight prettyprint">
static robj *createObject(int type, void *ptr) {
   ---
   else {
        if (server.vm_enabled) {
            pthread_mutex_unlock(&amp;server.obj_freelist_mutex);
            o = zmalloc(sizeof(*o));
        } else {
            o = zmalloc(sizeof(*o)-sizeof(struct redisObjectVM));
        }
    }
    ---
    if (server.vm_enabled) {
        /* Note that this code may run in the context of an I/O thread
         * and accessing to server.unixtime in theory is an error
         * (no locks). But in practice this is safe, and even if we read
         * garbage Redis will not fail, as it's just a statistical info */
        o-&gt;vm.atime = server.unixtime;
        o-&gt;storage = REDIS_VM_MEMORY;
    }
    return o;
}
</pre>
<p>VM的所有相关结构保存在redisServer 的如下几个字段中。</p>
<pre class="wp-code-highlight prettyprint">
 /* Global server state structure */
struct redisServer {
    ---
    /* Virtual memory state */
    FILE *vm_fp;
    int vm_fd;
    off_t vm_next_page; /* Next probably empty page */
    off_t vm_near_pages; /* Number of pages allocated sequentially */
    unsigned char *vm_bitmap; /* Bitmap of free/used pages */
    time_t unixtime;    /* Unix time sampled every second. */

    /* Virtual memory I/O threads stuff */
    /* An I/O thread process an element taken from the io_jobs queue and
     * put the result of the operation in the io_done list. While the
     * job is being processed, it's put on io_processing queue. */
    list *io_newjobs; /* List of VM I/O jobs yet to be processed */
    list *io_processing; /* List of VM I/O jobs being processed */
    list *io_processed; /* List of VM I/O jobs already processed */
    list *io_ready_clients; /* Clients ready to be unblocked. All keys loaded */
    pthread_mutex_t io_mutex; /* lock to access io_jobs/io_done/io_thread_job */
    pthread_mutex_t obj_freelist_mutex; /* safe redis objects creation/free */
    pthread_mutex_t io_swapfile_mutex; /* So we can lseek + write */
    pthread_attr_t io_threads_attr; /* attributes for threads creation */
    int io_active_threads; /* Number of running I/O threads */
    int vm_max_threads; /* Max number of I/O threads running at the same time */
    /* Our main thread is blocked on the event loop, locking for sockets ready
     * to be read or written, so when a threaded I/O operation is ready to be
     * processed by the main thread, the I/O thread will use a unix pipe to
     * awake the main thread. The followings are the two pipe FDs. */
    int io_ready_pipe_read;
    int io_ready_pipe_write;
    /* Virtual memory stats */
    unsigned long long vm_stats_used_pages;
    unsigned long long vm_stats_swapped_objects;
    unsigned long long vm_stats_swapouts;
    unsigned long long vm_stats_swapins;
   ---
};
</pre>
<p>vm_fp 和vm_fd指向磁盘上的vm文件，通过这两个指针来读写vm文件。vm_bitmap管理着vm文件中每一页的分配与释放情况（每一项为0表示该页空闲，为1表示已使用）。每一页的大小通过vm-page-size来配置，页数通过vm-pages来配置。值得一提的是，redis中的每一页最多只能放置一个对象，一个对象可以放在连续的多个页上。unixtime只是缓存时间值，这在计算value的最近使用频率时会用到。接下来的结构跟多线程方式换出/换进vlue有关。使用多线程方式时，换进/换出value被看成一个个的job，job的类型有如下几种：</p>
<pre class="wp-code-highlight prettyprint">
/* VM threaded I/O request message */
#define REDIS_IOJOB_LOAD 0          /* Load from disk to memory */
#define REDIS_IOJOB_PREPARE_SWAP 1  /* Compute needed pages */
#define REDIS_IOJOB_DO_SWAP 2       /* Swap from memory to disk */

typedef struct iojob {
    int type;   /* Request type, REDIS_IOJOB_* */
    redisDb *db;/* Redis database */
    robj *key;  /* This I/O request is about swapping this key */
    robj *val;  /* the value to swap for REDIS_IOREQ_*_SWAP, otherwise this
                 * field is populated by the I/O thread for REDIS_IOREQ_LOAD. */
    off_t page; /* Swap page where to read/write the object */
    off_t pages; /* Swap pages needed to save object. PREPARE_SWAP return val */
    int canceled; /* True if this command was canceled by blocking side of VM */
    pthread_t thread; /* ID of the thread processing this entry */
} iojob;
</pre>
<p>类型为REDIS_IOJOB_LOAD的job用来加载某个value，类型为REDIS_IOJOB_DO_SWAP的job就用来换出某个 value，在换出value之前，需要创建类型为REDIS_IOJOB_PREPARE_SWAP的job来计算所需的交换页数。</p>
<p>无论是上述3种中的哪一种，新建的job都会使用queueIOJob放在io_newjobs队列中，而线程入口函数IOThreadEntryPoint 会将io_newjobs中的job移入server.io_processing，然后在做完job类型的工作后（加载value/计算value所需交换页数/换出value），将job从server.io_processing移入io_processed中。然后往 server.io_ready_pipe_write所在的管道（io_ready_pipe_read、io_ready_pipe_write组成管道的两端）写入一个字节，让睡眠中的vmThreadedIOCompletedJob继续运行，该函数会做些后续工作。</p>
<p>io_ready_clients保存了可以继续运行的client链表（之前因为等待value已阻塞），后面几个结构跟多线程的保护和全局的vm统计有关。</p>
<p>VM的初始化在vmInit中，主要做的工作就是上面介绍的几个结构的初始化。除此之外，最重要的工作就是设置管道的read事件的处理函数vmThreadedIOCompletedJob，该函数会在管道可读时运行，跟多线程的运行密切相关。</p>
<pre class="wp-code-highlight prettyprint">
static void vmInit(void) {
    off_t totsize;
    int pipefds[2];
    size_t stacksize;
    struct flock fl;

    if (server.vm_max_threads != 0)
        zmalloc_enable_thread_safeness(); /* we need thread safe zmalloc() */

    redisLog(REDIS_NOTICE,&quot;Using '%s' as swap file&quot;,server.vm_swap_file);
    /* Try to open the old swap file, otherwise create it */
    if ((server.vm_fp = fopen(server.vm_swap_file,&quot;r+b&quot;)) == NULL) {
        server.vm_fp = fopen(server.vm_swap_file,&quot;w+b&quot;);
    }
    if (server.vm_fp == NULL) {
        redisLog(REDIS_WARNING,
            &quot;Can't open the swap file: %s. Exiting.&quot;,
            strerror(errno));
        exit(1);
    }
    server.vm_fd = fileno(server.vm_fp);
    /* Lock the swap file for writing, this is useful in order to avoid
     * another instance to use the same swap file for a config error. */
    fl.l_type = F_WRLCK;
    fl.l_whence = SEEK_SET;
    fl.l_start = fl.l_len = 0;
    if (fcntl(server.vm_fd,F_SETLK,&amp;fl) == -1) {
        redisLog(REDIS_WARNING,
            &quot;Can't lock the swap file at '%s': %s. Make sure it is not used by another Redis instance.&quot;, server.vm_swap_file, strerror(errno));
        exit(1);
    }
    /* Initialize */
    server.vm_next_page = 0;
    server.vm_near_pages = 0;
    server.vm_stats_used_pages = 0;
    server.vm_stats_swapped_objects = 0;
    server.vm_stats_swapouts = 0;
    server.vm_stats_swapins = 0;
    totsize = server.vm_pages*server.vm_page_size;
    redisLog(REDIS_NOTICE,&quot;Allocating %lld bytes of swap file&quot;,totsize);
    if (ftruncate(server.vm_fd,totsize) == -1) {
        redisLog(REDIS_WARNING,&quot;Can't ftruncate swap file: %s. Exiting.&quot;,
            strerror(errno));
        exit(1);
    } else {
        redisLog(REDIS_NOTICE,&quot;Swap file allocated with success&quot;);
    }
    server.vm_bitmap = zmalloc((server.vm_pages+7)/8);
    redisLog(REDIS_VERBOSE,&quot;Allocated %lld bytes page table for %lld pages&quot;,
        (long long) (server.vm_pages+7)/8, server.vm_pages);
    memset(server.vm_bitmap,0,(server.vm_pages+7)/8);

    /* Initialize threaded I/O (used by Virtual Memory) */
    server.io_newjobs = listCreate();
    server.io_processing = listCreate();
    server.io_processed = listCreate();
    server.io_ready_clients = listCreate();
    pthread_mutex_init(&amp;server.io_mutex,NULL);
    pthread_mutex_init(&amp;server.obj_freelist_mutex,NULL);
    pthread_mutex_init(&amp;server.io_swapfile_mutex,NULL);
    server.io_active_threads = 0;
    if (pipe(pipefds) == -1) {
        redisLog(REDIS_WARNING,&quot;Unable to intialized VM: pipe(2): %s. Exiting.&quot;
            ,strerror(errno));
        exit(1);
    }
    server.io_ready_pipe_read = pipefds[0];
    server.io_ready_pipe_write = pipefds[1];
    redisAssert(anetNonBlock(NULL,server.io_ready_pipe_read) != ANET_ERR);
    /* LZF requires a lot of stack */
    pthread_attr_init(&amp;server.io_threads_attr);
    pthread_attr_getstacksize(&amp;server.io_threads_attr, &amp;stacksize);

    /* Solaris may report a stacksize of 0, let's set it to 1 otherwise 115
     * multiplying it by 2 in the while loop later will not really help <img src='http://www.petermao.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  */
    if (!stacksize) stacksize = 1;

    while (stacksize &lt; REDIS_THREAD_STACK_SIZE) stacksize *= 2;
    pthread_attr_setstacksize(&amp;server.io_threads_attr, stacksize);
    /* Listen for events in the threaded I/O pipe */
    if (aeCreateFileEvent(server.el, server.io_ready_pipe_read, AE_READABLE,
        vmThreadedIOCompletedJob, NULL) == AE_ERR)
        oom(&quot;creating file event&quot;);
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/116.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析22–协议</title>
		<link>http://www.petermao.com/redis/114.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=114</link>
		<comments>http://www.petermao.com/redis/114.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:55:03 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=67</guid>
		<description><![CDATA[redis默认使用tcp协议的6379端口，其协议是文本行格式的而不是二进制格式的，每一行都以&#8221;\r\n&#8221;结尾，非常容易理解。 参考ProtocolSpecification.html就知道，发布到redis的命令有如下几种返回格式（对于不存在的值，会返回-1，此时client library应返回合适的nil对象（比如C语言的NULL），而不是空字符串）： 1）第一个字节是字符“-”，后面跟着一行出错信息（error reply） 比如lpop命令，当操作的对象不是一个链表时，会返回如下出错信息： &#8220;-ERR Operation against a key holding the wrong kind of value\r\n&#8221; 2）第一个字节是字符“+”，后面跟着一行表示执行结果的提示信息（line reply） 比如set命令执行成功后，会返回&#8221;+OK\r\n&#8221; 3）第一个字节是字符“$”，后面先跟一行，仅有一个数字，该数字表示下一行字符的个数（若不存在，则数字为-1）（bulk reply） 比如get命令，成功时返回值的类似于“$7\r\nmyvalue”，不存在时返回的信息为“$-1\r\n” 4）第一个字节是字符“*”，后面先跟一行，仅有一个数字，该数字表示bulk reply的个数（若不存在，则数字为-1）（multi-bulk reply） 比如lrange命令，若要求返回0&#8211;2之间的值，则成功时返回值类似于&#8221;*3\r\n$6\r\nvalue1\r\n$7\r\nmyvalue\r\n$5\r\nhello\r\n&#8221;，不存在时返回的信息类似于&#8221;*-1\r\n&#8221;。 5）第一个字节是字符“:”，后面跟着一个整数值（integer reply） 比如incr命令，成功时会返回对象+1后的值。 而client发布命令的格式有如下几种，第一个字符串都必须是命令字，不同的参数之间用1个空格来分隔： 1）Inline Command:：仅一行 比如 EXISTS命令，client发送的字节流类似于&#8221;EXISTS mykey\r\n&#8221;。 2）Bulk Command：类似于返回协议的bulk reply，一般有两行，第一行依次为“命令字 参数 &#8230; <a href="http://www.petermao.com/redis/114.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>redis默认使用tcp协议的6379端口，其协议是文本行格式的而不是二进制格式的，每一行都以&#8221;\r\n&#8221;结尾，非常容易理解。</p>
<p>参考ProtocolSpecification.html就知道，发布到redis的命令有如下几种返回格式（对于不存在的值，会返回-1，此时client library应返回合适的nil对象（比如C语言的NULL），而不是空字符串）：</p>
<p>1）第一个字节是字符“-”，后面跟着一行出错信息（error reply）<span id="more-67"></span></p>
<p>比如lpop命令，当操作的对象不是一个链表时，会返回如下出错信息：</p>
<p>&#8220;-ERR Operation against a key holding the wrong kind of value\r\n&#8221;</p>
<p>2）第一个字节是字符“+”，后面跟着一行表示执行结果的提示信息（line reply）</p>
<p>比如set命令执行成功后，会返回&#8221;+OK\r\n&#8221;</p>
<p>3）第一个字节是字符“$”，后面先跟一行，仅有一个数字，该数字表示下一行字符的个数（若不存在，则数字为-1）（bulk reply）</p>
<p>比如get命令，成功时返回值的类似于“$7\r\nmyvalue”，不存在时返回的信息为“$-1\r\n”</p>
<p>4）第一个字节是字符“*”，后面先跟一行，仅有一个数字，该数字表示bulk reply的个数（若不存在，则数字为-1）（multi-bulk reply）</p>
<p>比如lrange命令，若要求返回0&#8211;2之间的值，则成功时返回值类似于&#8221;*3\r\n$6\r\nvalue1\r\n$7\r\nmyvalue\r\n$5\r\nhello\r\n&#8221;，不存在时返回的信息类似于&#8221;*-1\r\n&#8221;。</p>
<p>5）第一个字节是字符“:”，后面跟着一个整数值（integer reply）</p>
<p>比如incr命令，成功时会返回对象+1后的值。</p>
<p>而client发布命令的格式有如下几种，第一个字符串都必须是命令字，不同的参数之间用1个空格来分隔：<br />
1）Inline Command:：仅一行 </p>
<p>比如 EXISTS命令，client发送的字节流类似于&#8221;EXISTS mykey\r\n&#8221;。<br />
2）Bulk Command：类似于返回协议的bulk reply，一般有两行，第一行依次为“命令字 参数  一个数字”，该数字表示下一行字符的个数</p>
<p>比如SET命令，client发送的字节流类似于&#8221;SET mykey 5\r\nhello\r\n&#8221;。</p>
<p>3）multi-bulk Command：跟返回协议的multi-bulk reply类似。</p>
<p>比如上面的SET命令，用multi-bulk协议表示则为“*3\r\n$3\r\nSET\r\n$5\r\nmykey\r\n$5\r\nhello\r\n”。</p>
<p>尽管对于某些命令该协议发送的字节流比bulk command形式要多，但它可以支持任何一种命令，支持跟多个二进制安全参数的命令（bulk command仅支持一个），也可以使得client library不修改代码就能支持redis新发布的命令（只要把不支持的命令按multi-bulk形式发布即可）。redis的官方文档中还提到，未来可能仅支持client采用multi-bulk Command格式发布命令。</p>
<p>另外提一下，client library可以连续发布多条命令，而不是等到redis返回前一条命令的执行结果才发布新的命令，这种机制被称作pipelining，支持redis的client library大多支持这种机制，读者可自行参考。</p>
<p>最后来看看redis实现时用来返回信息的相关函数。</p>
<p>redis 会使用addReplySds、addReplyDouble、addReplyLongLong、addReplyUlong、 addReplyBulkLen、addReplyBulk、addReplyBulkCString等来打包不同的返回信息，最终调用addReply 来发送信息。</p>
<p>addReply会将发送信息添加到相应redisClient的reply链表尾部，并使用 sendReplyToClient来发送。sendReplyToClient会遍历reply链表，并依次发送，其间如果可以打包 reply（server.glueoutputbuf为真），则可以使用glueReplyBuffersIfNeeded把reply链表中的值合并到一个缓冲区，然后一次性发送。</p>
<pre class="wp-code-highlight prettyprint">
static void addReply(redisClient *c, robj *obj) {
    if (listLength(c-&gt;reply) == 0 &amp;&amp;
        (c-&gt;replstate == REDIS_REPL_NONE ||
         c-&gt;replstate == REDIS_REPL_ONLINE) &amp;&amp;
        aeCreateFileEvent(server.el, c-&gt;fd, AE_WRITABLE,
        sendReplyToClient, c) == AE_ERR) return;

    if (server.vm_enabled &amp;&amp; obj-&gt;storage != REDIS_VM_MEMORY) {
        obj = dupStringObject(obj);
        obj-&gt;refcount = 0; /* getDecodedObject() will increment the refcount */
    }
    listAddNodeTail(c-&gt;reply,getDecodedObject(obj));
}

static void sendReplyToClient(aeEventLoop *el, int fd, void *privdata, int mask) {
    redisClient *c = privdata;
    int nwritten = 0, totwritten = 0, objlen;
    robj *o;
    REDIS_NOTUSED(el);
    REDIS_NOTUSED(mask);

    /* Use writev() if we have enough buffers to send */
    if (!server.glueoutputbuf &amp;&amp;
        listLength(c-&gt;reply) &gt; REDIS_WRITEV_THRESHOLD &amp;&amp;
        !(c-&gt;flags &amp; REDIS_MASTER))
    {
        sendReplyToClientWritev(el, fd, privdata, mask);
        return;
    }

    while(listLength(c-&gt;reply)) {
        if (server.glueoutputbuf &amp;&amp; listLength(c-&gt;reply) &gt; 1)
            glueReplyBuffersIfNeeded(c);

        o = listNodeValue(listFirst(c-&gt;reply));
        objlen = sdslen(o-&gt;ptr);

        if (objlen == 0) {
            listDelNode(c-&gt;reply,listFirst(c-&gt;reply));
            continue;
        }

        if (c-&gt;flags &amp; REDIS_MASTER) {
            /* Don't reply to a master */
            nwritten = objlen - c-&gt;sentlen;
        } else {
            nwritten = write(fd, ((char*)o-&gt;ptr)+c-&gt;sentlen, objlen - c-&gt;sentlen);
            if (nwritten &lt;= 0) break;
        }
        c-&gt;sentlen += nwritten;
        totwritten += nwritten;
        /* If we fully sent the object on head go to the next one */
        if (c-&gt;sentlen == objlen) {
            listDelNode(c-&gt;reply,listFirst(c-&gt;reply));
            c-&gt;sentlen = 0;
        }
        /* Note that we avoid to send more thank REDIS_MAX_WRITE_PER_EVENT
         * bytes, in a single threaded server it's a good idea to serve
         * other clients as well, even if a very large request comes from
         * super fast link that is always able to accept data (in real world
         * scenario think about 'KEYS *' against the loopback interfae) */
        if (totwritten &gt; REDIS_MAX_WRITE_PER_EVENT) break;
    }
    if (nwritten == -1) {
        if (errno == EAGAIN) {
            nwritten = 0;
        } else {
            redisLog(REDIS_VERBOSE,
                &quot;Error writing to client: %s&quot;, strerror(errno));
            freeClient(c);
            return;
        }
    }
    if (totwritten &gt; 0) c-&gt;lastinteraction = time(NULL);
    if (listLength(c-&gt;reply) == 0) {
        c-&gt;sentlen = 0;
        aeDeleteFileEvent(server.el,c-&gt;fd,AE_WRITABLE);
    }
}
</pre>
<p>关于client library的实现，可按照前面介绍的格式自己实现，也可以阅读现有的client library来加深理解。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/114.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>redis源代码分析21– 事务</title>
		<link>http://www.petermao.com/redis/112.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=112</link>
		<comments>http://www.petermao.com/redis/112.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:53:56 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=64</guid>
		<description><![CDATA[redis的事务较简单，并不具备事务的acid的全部特征。主要原因之一是redis事务中的命令并不是立即执行的，会一直排队到发布exec命令才执行所有的命令；另一个主要原因是它不支持回滚，事务中的命令可以部分成功，部分失败，命令失败时跟不在事务上下文执行时返回的信息类似。不知道在未来会不会提供更好的支持。 我们且来看看现在redis事务的实现。 redis中跟事务相关的主要结构如下所示。每个redisClient的multiState保存了事务上下文要执行的命令。 /* Client MULTI/EXEC state */ typedef struct multiCmd { robj **argv; int argc; struct redisCommand *cmd; } multiCmd; typedef struct multiState { multiCmd *commands; /* Array of MULTI commands */ int count; /* Total number of MULTI &#8230; <a href="http://www.petermao.com/redis/112.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>redis的事务较简单，并不具备事务的acid的全部特征。主要原因之一是redis事务中的命令并不是立即执行的，会一直排队到发布exec命令才执行所有的命令；另一个主要原因是它不支持回滚，事务中的命令可以部分成功，部分失败，命令失败时跟不在事务上下文执行时返回的信息类似。不知道在未来会不会提供更好的支持。</p>
<p>我们且来看看现在redis事务的实现。</p>
<p>redis中跟事务相关的主要结构如下所示。每个redisClient的multiState保存了事务上下文要执行的命令。<span id="more-64"></span></p>
<pre class="wp-code-highlight prettyprint">
/* Client MULTI/EXEC state */
typedef struct multiCmd {
    robj **argv;
    int argc;
    struct redisCommand *cmd;
} multiCmd;

typedef struct multiState {
    multiCmd *commands;     /* Array of MULTI commands */
    int count;              /* Total number of MULTI commands */
} multiState;

typedef struct redisClient {
   ---
    multiState mstate;      /* MULTI/EXEC state */
    ---
} redisClient;
</pre>
<p>client通过发布multi命令进入事务上下文。处于事务上下文的client会设置REDIS_MULTI标志，multi命令会立即返回。</p>
<pre class="wp-code-highlight prettyprint">
static void multiCommand(redisClient *c) {
    c-&gt;flags |= REDIS_MULTI;
    addReply(c,shared.ok);
}
</pre>
<p>处于事务上下文中的client会将在exec命令前发布的命令排队到mstate，并不立即执行相应命令且立即返回 shared.queued（如果之前参数检查不正确，则会返回出错信息，那就不会排队到mstate中），这在processCommand函数中反映出来（对processCommand的详细解释可参看前面命令处理章节）。queueMultiCommand只是简单的扩大mstate数组，并将当前命令加入其中。</p>
<pre class="wp-code-highlight prettyprint">
static int processCommand(redisClient *c) {
    ---
   /* Exec the command */
    if (c-&gt;flags &amp; REDIS_MULTI &amp;&amp; cmd-&gt;proc != execCommand &amp;&amp; cmd-&gt;proc != discardCommand) {
        queueMultiCommand(c,cmd);
        addReply(c,shared.queued);
    } else {
        if (server.vm_enabled &amp;&amp; server.vm_max_threads &gt; 0 &amp;&amp;
            blockClientOnSwappedKeys(c,cmd)) return 1;
        call(c,cmd);
    }
    ---
}
</pre>
<p>当client发布exec命令时，则redis会调用execCommand来执行事务上下文中的命令集合。注意，在此之前，redis会使用execBlockClientOnSwappedKeys提前加载其命令集所需的key（该函数最终是调用前面介绍过的 waitForMultipleSwappedKeys来加载key）。因为这在命令表cmdTable是这样设置的：</p>
<pre class="wp-code-highlight prettyprint">
{&quot;exec&quot;,execCommand,1,REDIS_CMD_INLINE|REDIS_CMD_DENYOOM,execBlockClientOnSwappedKeys,0,0,0},
</pre>
<p>execCommand会检查是不是处于事务上下文，然后使用execCommandReplicateMulti向 slave/monitor/aof（前提是使用这些功能）发送/写入multi命令字，因为multi命令本身没有排队，而execCommand会在执行完后写入exec命令的，必须让exec和multi命令配对，这之后就是调用call依次执行每个命令了。从这里没有检查call的返回就可以看出，如果命令执行失败了，只能由call命令本身返回出错信息，这里并不检查命令执行的成功与否，最后就是清空mstate中的命令字并取消 REDIS_MULTI状态了。</p>
<pre class="wp-code-highlight prettyprint">
static void execCommand(redisClient *c) {
    int j;
    robj **orig_argv;
    int orig_argc;

    if (!(c-&gt;flags &amp; REDIS_MULTI)) {
        addReplySds(c,sdsnew(&quot;-ERR EXEC without MULTI\r\n&quot;));
        return;
    }

    /* Replicate a MULTI request now that we are sure the block is executed.
     * This way we'll deliver the MULTI/..../EXEC block as a whole and
     * both the AOF and the replication link will have the same consistency
     * and atomicity guarantees. */
    execCommandReplicateMulti(c);

    /* Exec all the queued commands */
    orig_argv = c-&gt;argv;
    orig_argc = c-&gt;argc;
    addReplySds(c,sdscatprintf(sdsempty(),&quot;*%d\r\n&quot;,c-&gt;mstate.count));
    for (j = 0; j &lt; c-&gt;mstate.count; j++) {
        c-&gt;argc = c-&gt;mstate.commands[j].argc;
        c-&gt;argv = c-&gt;mstate.commands[j].argv;
        call(c,c-&gt;mstate.commands[j].cmd);
    }
    c-&gt;argv = orig_argv;
    c-&gt;argc = orig_argc;
    freeClientMultiState(c);
    initClientMultiState(c);
    c-&gt;flags &amp;= (~REDIS_MULTI);
    /* Make sure the EXEC command is always replicated / AOF, since we
     * always send the MULTI command (we can't know beforehand if the
     * next operations will contain at least a modification to the DB). */
    server.dirty++;
}
</pre>
<p>最后稍微提一下，如果事务上下文执行过程中，redis突然down掉，也就是最后的exec命令没有写入，此时会让 slave/monitor/aof处于不正确的状态。redis会在重启后会检查到这一情况，这是在loadAppendOnlyFile中完成的。当然这一检测执行的前提是down掉前和重启后都使用aof进行持久化。redis在检测到这一情况后，会退出程序。用户可调用用redis-check- aof工具进行修复。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/112.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析20–发布/订阅</title>
		<link>http://www.petermao.com/redis/110.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=110</link>
		<comments>http://www.petermao.com/redis/110.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:52:44 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=60</guid>
		<description><![CDATA[redis的发布/订阅（publish/subscribe）功能类似于传统的消息路由功能，发布者发布消息，订阅者接收消息，沟通发布者和订阅者之间的桥梁是订阅的channel或者pattern。发布者向指定的publish或者pattern发布消息，订阅者阻塞在订阅的channel或者pattern。可以看到，发布者不会指定哪个订阅者才能接收消息，订阅者也无法只接收特定发布者的消息。这种订阅者和发布者之间的关系是松耦合的，订阅者不知道是谁发布的消息，发布者也不知道谁会接收消息。 redis的发布/订阅功能主要通过SUBSCRIBE、UNSUBSCRIBE、PSUBSCRIBE、PUNSUBSCRIBE 、PUBLISH五个命令来表现。其中SUBSCRIBE、UNSUBSCRIBE用于订阅或者取消订阅channel，而PSUBSCRIBE、PUNSUBSCRIBE用于订阅或者取消订阅pattern，发布消息则通过publish命令。 对于发布/订阅功能的实现，我们先来看看几个与此相关的结构。 struct redisServer { --- /* Pubsub */ dict *pubsub_channels;/* Map channels to list of subscribed clients */ list *pubsub_patterns;/* A list of pubsub_patterns */ --- } typedef struct redisClient { --- dict *pubsub_channels; /* channels a &#8230; <a href="http://www.petermao.com/redis/110.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>redis的发布/订阅（publish/subscribe）功能类似于传统的消息路由功能，发布者发布消息，订阅者接收消息，沟通发布者和订阅者之间的桥梁是订阅的channel或者pattern。发布者向指定的publish或者pattern发布消息，订阅者阻塞在订阅的channel或者pattern。可以看到，发布者不会指定哪个订阅者才能接收消息，订阅者也无法只接收特定发布者的消息。这种订阅者和发布者之间的关系是松耦合的，订阅者不知道是谁发布的消息，发布者也不知道谁会接收消息。</p>
<p>redis的发布/订阅功能主要通过SUBSCRIBE、UNSUBSCRIBE、PSUBSCRIBE、PUNSUBSCRIBE 、PUBLISH五个命令来表现。其中SUBSCRIBE、UNSUBSCRIBE用于订阅或者取消订阅channel，而PSUBSCRIBE、PUNSUBSCRIBE用于订阅或者取消订阅pattern，发布消息则通过publish命令。</p>
<p>对于发布/订阅功能的实现，我们先来看看几个与此相关的结构。<span id="more-60"></span></p>
<pre class="wp-code-highlight prettyprint">
struct redisServer {
    ---
   /* Pubsub */
   dict *pubsub_channels;/* Map channels to list of subscribed clients */
   list *pubsub_patterns;/* A list of pubsub_patterns */
   ---
}

typedef struct redisClient {
   ---
   dict *pubsub_channels; /* channels a client is interested in (SUBSCRIBE) */
   list *pubsub_patterns; /* patterns a client is interested in (SUBSCRIBE) */
} redisClient;
</pre>
<p>在redis的全局server变量（redisServer类型）中，channel和订阅者之间的关系用字典pubsub_channels来保存，特定channel和所有订阅者组成的链表构成pubsub_channels字典中的一项，即字典中的每一项可表示为（channel，订阅者链表）；pattern和订阅者之间的关系用链表pubsub_patterns来保存，链表中的每一项可表示成（pattern，redisClient）组成的字典。</p>
<p>在特定订阅者redisClient的结构中，pubsub_channels保存着它所订阅的channel的字典，而订阅的模式则保存在链表pubsub_patterns中。</p>
<p>从上面的解释，我们再来看看订阅/发布命令的最坏时间复杂度（注意字典增删查改一项的复杂度为O(1)，而链表的查删复杂度为O(N)，从链表尾部增加一项的复杂度为O(1)）。</p>
<p>SUBSCRIBE：</p>
<p>订阅者用SUBSCRIBE订阅特定channel，这需要在订阅者的redisClient结构中的pubsub_channels增加一项（复杂度为 O(1)），然后在redisServer 的pubsub_channels找到该channel（复杂度为O(1)），并在该channel的订阅者链表的尾部增加一项（复杂度为O(1)，注意，如果pubsub_channels中没找到该channel，则插入的复杂度也同为O(1)），因此订阅者用SUBSCRIBE订阅特定 channel的最坏时间复杂度为O(1)。</p>
<p>UNSUBSCRIBE：</p>
<p>订阅者取消订阅时，需要先从订阅者的redisClient结构中的pubsub_channels删除一项（复杂度为O(1)），然后在 redisServer 的pubsub_channels找到该channel（复杂度为O(1)），然后在channel的订阅者链表中删除该订阅者（复杂度为O(1)），因此总的复杂度为O(N)，N为特定channel的订阅者数。</p>
<p>PSUBSCRIBE：</p>
<p>订阅者用PSUBSCRIBE订阅pattern时，需要先在redisClient结构中的pubsub_patterns先查找是否已存在该 pattern（复杂度为O(N)），并在不存在的情况下往redisClient结构中的pubsub_patterns和redisServer结构中的pubsub_patterns链表尾部各增加一项（复杂度都为O(1)），因此，总的复杂度为O(N)，其中N为订阅者已订阅的模式。</p>
<p>PUNSUBSCRIBE：</p>
<p>订阅者用PUNSUBSCRIBE取消对pattern的订阅时，需要先在redisClient结构中的pubsub_patterns链表中删除该 pattern（复杂度为O(N)），并在redisServer结构中的pubsub_patterns链表中删除订阅者和pattern组成的映射（复杂度为O(M），因此，总的复杂度为O(N+M)，其中N为订阅者已订阅的模式，而M为系统中所有订阅者和所有pattern组成的映射数。</p>
<p>PUBLISH：</p>
<p>发布消息时，只会向特定channel发布，但该channel可能会匹配某个pattern。因此，需要先在redisServer结构中的 pubsub_channels找到该channel的订阅者链表（O(1)），然后发送给所有订阅者（复杂度为O(N)），然后查看 redisServer结构中的pubsub_patterns链表中的所有项，看channel是否和该项中的pattern匹配（复杂度为O(M)）（注意，这并不包括模式匹配的复杂度），因此，总的复杂度为O(N+M)，。其中N为该channel的订阅者数，而M为系统中所有订阅者和所有 pattern组成的映射数。另外，从这也可以看出，一个订阅者是可能多次收到同一个消息的。</p>
<p>解释了发布/订阅的算法后，其代码就好理解了，这里仅给出PUBLISH命令的处理函数publishCommand的代码，更多相关命令的代码请参看redis的源代码。</p>
<pre class="wp-code-highlight prettyprint">
static void publishCommand(redisClient *c) {
    int receivers = pubsubPublishMessage(c-&gt;argv[1],c-&gt;argv[2]);
    addReplyLongLong(c,receivers);
}

/* Publish a message */
static int pubsubPublishMessage(robj *channel, robj *message) {
    int receivers = 0;
    struct dictEntry *de;
    listNode *ln;
    listIter li;

    /* Send to clients listening for that channel */
    de = dictFind(server.pubsub_channels,channel);
    if (de) {
        list *list = dictGetEntryVal(de);
        listNode *ln;
        listIter li;

        listRewind(list,&amp;li);
        while ((ln = listNext(&amp;li)) != NULL) {
            redisClient *c = ln-&gt;value;

            addReply(c,shared.mbulk3);
            addReply(c,shared.messagebulk);
            addReplyBulk(c,channel);
            addReplyBulk(c,message);
            receivers++;
        }
    }
    /* Send to clients listening to matching channels */
    if (listLength(server.pubsub_patterns)) {
        listRewind(server.pubsub_patterns,&amp;li);
        channel = getDecodedObject(channel);
        while ((ln = listNext(&amp;li)) != NULL) {
            pubsubPattern *pat = ln-&gt;value;

            if (stringmatchlen((char*)pat-&gt;pattern-&gt;ptr,
                                sdslen(pat-&gt;pattern-&gt;ptr),
                                (char*)channel-&gt;ptr,
                                sdslen(channel-&gt;ptr),0)) {
                addReply(pat-&gt;client,shared.mbulk4);
                addReply(pat-&gt;client,shared.pmessagebulk);
                addReplyBulk(pat-&gt;client,pat-&gt;pattern);
                addReplyBulk(pat-&gt;client,channel);
                addReplyBulk(pat-&gt;client,message);
                receivers++;
            }
        }
        decrRefCount(channel);
    }
    return receivers;
}
</pre>
<p>最后提醒一下，处于发布/订阅模式的client，是无法发布上述五种命令之外的命令（quit除外），这是在processCommand函数中检查的，可以参看前面命令处理章节对该函数的解释。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/110.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析19–主从复制</title>
		<link>http://www.petermao.com/redis/108.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=108</link>
		<comments>http://www.petermao.com/redis/108.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:52:24 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=61</guid>
		<description><![CDATA[先说下Redis主从复制的特点。 官方文档ReplicationHowto中提到以下特点： 1. 一个master支持多个slave 2. slave可以接受其他slave的连接，作为其他slave的master，从而形成一个master-slave的多级结构 3. 复制在master端是非阻塞的，也就是master在向client复制时可处理其他client的命令，而slave在第一次同步时是阻塞的 4. 复制被利用来提供可扩展性，比如可以将slave端用作数据冗余，也可以将耗时的命令（比如sort）发往某些slave从而避免master的阻塞，另外也可以用slave做持久化，这只需要将master的配置文件中的save指令注释掉。 client可以在一开始时作为slave连接master，也可以在运行后发布sync命令，从而跟master建立主从关系。 接下来我们分别从slave和master的视角概述下redis的主从复制的运行机制。 如果redis作为slave运行，则全局变量server.replstate的状态有REDIS_REPL_NONE（不处于复制状态）、 REDIS_REPL_CONNECT（需要跟master建立连接）、REDIS_REPL_CONNECTED（已跟master建立连接）三种。在读入slaveof配置或者发布slaveof命令后，server.replstate取值为REDIS_REPL_CONNECT，然后在syncWithMaster跟master执行第一次同步后，取值变为REDIS_REPL_CONNECTED。 如果redis作为master运行，则对应某个客户端连接的变量slave.replstate的状态有REDIS_REPL_WAIT_BGSAVE_START（等待bgsave运行）、REDIS_REPL_WAIT_BGSAVE_END（bgsave已dump db，该bulk传输了）、REDIS_REPL_SEND_BULK（正在bulk传输）、REDIS_REPL_ONLINE（已完成开始的bulk传输，以后只需发送更新了）。对于slave客户端（发布sync命令），一开始slave.replstate都处于REDIS_REPL_WAIT_BGSAVE_START状态（后面详解syncCommand函数），然后在后台dump db后（backgroundSaveDoneHandler函数），处于REDIS_REPL_WAIT_BGSAVE_END 状态，然后updateSlavesWaitingBgsave会将状态置为REDIS_REPL_SEND_BULK，并设置write事件的函数 sendBulkToSlave，在sendBulkToSlave运行后，状态就变为REDIS_REPL_ONLINE了，此后master会一直调用replicationFeedSlaves给处于REDIS_REPL_ONLINE状态的slave发送新命令。 我们先看处于master端的redis会执行的代码。 slave端都是通过发布sync命令来跟master同步的，sync命令的处理函数syncCommand如下所示。 该函数中的注释足够明了。如果slave的client设置了REDIS_SLAVE标志，说明master已用syncCommand处理了该 slave。如果master还有对这个client的reply没有发送，则返回出错信息。此后若server.bgsavechildpid != -1且有slave处于REDIS_REPL_WAIT_BGSAVE_END状态，则说明dump db的后台进程刚结束，此时新的slave可直接用保存的rdb进行bulk传输（注意复制reply参数，因为master是非阻塞的，此时可能执行了一些命令，call函数会调用replicationFeedSlaves函数将命令参数保存到slave的reply参数中）。如果没有slave处于REDIS_REPL_WAIT_BGSAVE_END状态，但server.bgsavechildpid != -1，则说明bgsave后台进程没有运行完，需要等待其结束（bgsave后台进程结束后会处理等待的slave）。如果server.bgsavechildpid 等于 -1，则需要启动一个后台进程来dump db了。最后将当前client加到master的slaves链表中。 static void syncCommand(redisClient *c) { /* ignore SYNC if &#8230; <a href="http://www.petermao.com/redis/108.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>先说下Redis主从复制的特点。</p>
<p>官方文档ReplicationHowto中提到以下特点：<br />
1. 一个master支持多个slave<br />
2. slave可以接受其他slave的连接，作为其他slave的master，从而形成一个master-slave的多级结构<br />
3. 复制在master端是非阻塞的，也就是master在向client复制时可处理其他client的命令，而slave在第一次同步时是阻塞的<br />
4. 复制被利用来提供可扩展性，比如可以将slave端用作数据冗余，也可以将耗时的命令（比如sort）发往某些slave从而避免master的阻塞，另外也可以用slave做持久化，这只需要将master的配置文件中的save指令注释掉。</p>
<p>client可以在一开始时作为slave连接master，也可以在运行后发布sync命令，从而跟master建立主从关系。</p>
<p>接下来我们分别从slave和master的视角概述下redis的主从复制的运行机制。<span id="more-61"></span></p>
<p> 如果redis作为slave运行，则全局变量server.replstate的状态有REDIS_REPL_NONE（不处于复制状态）、 REDIS_REPL_CONNECT（需要跟master建立连接）、REDIS_REPL_CONNECTED（已跟master建立连接）三种。在读入slaveof配置或者发布slaveof命令后，server.replstate取值为REDIS_REPL_CONNECT，然后在syncWithMaster跟master执行第一次同步后，取值变为REDIS_REPL_CONNECTED。</p>
<p>如果redis作为master运行，则对应某个客户端连接的变量slave.replstate的状态有REDIS_REPL_WAIT_BGSAVE_START（等待bgsave运行）、REDIS_REPL_WAIT_BGSAVE_END（bgsave已dump db，该bulk传输了）、REDIS_REPL_SEND_BULK（正在bulk传输）、REDIS_REPL_ONLINE（已完成开始的bulk传输，以后只需发送更新了）。对于slave客户端（发布sync命令），一开始slave.replstate都处于REDIS_REPL_WAIT_BGSAVE_START状态（后面详解syncCommand函数），然后在后台dump db后（backgroundSaveDoneHandler函数），处于REDIS_REPL_WAIT_BGSAVE_END 状态，然后updateSlavesWaitingBgsave会将状态置为REDIS_REPL_SEND_BULK，并设置write事件的函数 sendBulkToSlave，在sendBulkToSlave运行后，状态就变为REDIS_REPL_ONLINE了，此后master会一直调用replicationFeedSlaves给处于REDIS_REPL_ONLINE状态的slave发送新命令。</p>
<p>我们先看处于master端的redis会执行的代码。</p>
<p>slave端都是通过发布sync命令来跟master同步的，sync命令的处理函数syncCommand如下所示。</p>
<p>该函数中的注释足够明了。如果slave的client设置了REDIS_SLAVE标志，说明master已用syncCommand处理了该 slave。如果master还有对这个client的reply没有发送，则返回出错信息。此后若server.bgsavechildpid != -1且有slave处于REDIS_REPL_WAIT_BGSAVE_END状态，则说明dump db的后台进程刚结束，此时新的slave可直接用保存的rdb进行bulk传输（注意复制reply参数，因为master是非阻塞的，此时可能执行了一些命令，call函数会调用replicationFeedSlaves函数将命令参数保存到slave的reply参数中）。如果没有slave处于REDIS_REPL_WAIT_BGSAVE_END状态，但server.bgsavechildpid != -1，则说明bgsave后台进程没有运行完，需要等待其结束（bgsave后台进程结束后会处理等待的slave）。如果server.bgsavechildpid 等于 -1，则需要启动一个后台进程来dump db了。最后将当前client加到master的slaves链表中。</p>
<pre class="wp-code-highlight prettyprint">
static void syncCommand(redisClient *c) {
    /* ignore SYNC if aleady slave or in monitor mode */
    if (c-&gt;flags &amp; REDIS_SLAVE) return;

    /* SYNC can't be issued when the server has pending data to send to
     * the client about already issued commands. We need a fresh reply
     * buffer registering the differences between the BGSAVE and the current
     * dataset, so that we can copy to other slaves if needed. */
    if (listLength(c-&gt;reply) != 0) {
        addReplySds(c,sdsnew(&quot;-ERR SYNC is invalid with pending input\r\n&quot;));
        return;
    }

    redisLog(REDIS_NOTICE,&quot;Slave ask for synchronization&quot;);
    /* Here we need to check if there is a background saving operation
     * in progress, or if it is required to start one */
    if (server.bgsavechildpid != -1) {
        /* Ok a background save is in progress. Let's check if it is a good
         * one for replication, i.e. if there is another slave that is
         * registering differences since the server forked to save */
        redisClient *slave;
        listNode *ln;
        listIter li;

        listRewind(server.slaves,&amp;li);
        while((ln = listNext(&amp;li))) {
            slave = ln-&gt;value;
            if (slave-&gt;replstate == REDIS_REPL_WAIT_BGSAVE_END) break;
        }
        if (ln) {
            /* Perfect, the server is already registering differences for
             * another slave. Set the right state, and copy the buffer. */
            listRelease(c-&gt;reply);
            c-&gt;reply = listDup(slave-&gt;reply);
            c-&gt;replstate = REDIS_REPL_WAIT_BGSAVE_END;
            redisLog(REDIS_NOTICE,&quot;Waiting for end of BGSAVE for SYNC&quot;);
        } else {
            /* No way, we need to wait for the next BGSAVE in order to
             * register differences */
            c-&gt;replstate = REDIS_REPL_WAIT_BGSAVE_START;
            redisLog(REDIS_NOTICE,&quot;Waiting for next BGSAVE for SYNC&quot;);
        }
    } else {
        /* Ok we don't have a BGSAVE in progress, let's start one */
        redisLog(REDIS_NOTICE,&quot;Starting BGSAVE for SYNC&quot;);
        if (rdbSaveBackground(server.dbfilename) != REDIS_OK) {
            redisLog(REDIS_NOTICE,&quot;Replication failed, can't BGSAVE&quot;);
            addReplySds(c,sdsnew(&quot;-ERR Unalbe to perform background save\r\n&quot;));
            return;
        }
        c-&gt;replstate = REDIS_REPL_WAIT_BGSAVE_END;
    }
    c-&gt;repldbfd = -1;
    c-&gt;flags |= REDIS_SLAVE;
    c-&gt;slaveseldb = 0;
    listAddNodeTail(server.slaves,c);
    return;
}
</pre>
<p>此后slave无论处于REDIS_REPL_WAIT_BGSAVE_START还是REDIS_REPL_WAIT_BGSAVE_END，都只能等 dump db的后台进程运行结束后才会被处理。该进程结束后会执行backgroundSaveDoneHandler函数，而该函数调用 updateSlavesWaitingBgsave来处理slaves。</p>
<p>updateSlavesWaitingBgsave和syncCommand一样，涉及到slave的几个状态变换。对于等待dump db的slave，master都会将其放入server.slaves 链表中。此时，若slave->replstate == REDIS_REPL_WAIT_BGSAVE_START，说明当前dump db不是该slave需要的，redis需要重新启动后台进程来dump db。若slave->replstate == REDIS_REPL_WAIT_BGSAVE_END，则说明当前dump db正是该slave所需要的，此时设置slave的write事件的处理函数sendBulkToSlave。</p>
<pre class="wp-code-highlight prettyprint">
static void updateSlavesWaitingBgsave(int bgsaveerr) {
    listNode *ln;
    int startbgsave = 0;
    listIter li;

    listRewind(server.slaves,&amp;li);
    while((ln = listNext(&amp;li))) {
        redisClient *slave = ln-&gt;value;

        if (slave-&gt;replstate == REDIS_REPL_WAIT_BGSAVE_START) {
            startbgsave = 1;
            slave-&gt;replstate = REDIS_REPL_WAIT_BGSAVE_END;
        } else if (slave-&gt;replstate == REDIS_REPL_WAIT_BGSAVE_END) {
            struct redis_stat buf;

            if (bgsaveerr != REDIS_OK) {
                freeClient(slave);
                redisLog(REDIS_WARNING,&quot;SYNC failed. BGSAVE child returned an error&quot;);
                continue;
            }
            if ((slave-&gt;repldbfd = open(server.dbfilename,O_RDONLY)) == -1 ||
                redis_fstat(slave-&gt;repldbfd,&amp;buf) == -1) {
                freeClient(slave);
                redisLog(REDIS_WARNING,&quot;SYNC failed. Can't open/stat DB after BGSAVE: %s&quot;, strerror(errno));
                continue;
            }
            slave-&gt;repldboff = 0;
            slave-&gt;repldbsize = buf.st_size;
            slave-&gt;replstate = REDIS_REPL_SEND_BULK;
            aeDeleteFileEvent(server.el,slave-&gt;fd,AE_WRITABLE);
            if (aeCreateFileEvent(server.el, slave-&gt;fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
                freeClient(slave);
                continue;
            }
        }
    }
    if (startbgsave) {
        if (rdbSaveBackground(server.dbfilename) != REDIS_OK) {
            listIter li;

            listRewind(server.slaves,&amp;li);
            redisLog(REDIS_WARNING,&quot;SYNC failed. BGSAVE failed&quot;);
            while((ln = listNext(&amp;li))) {
                redisClient *slave = ln-&gt;value;

                if (slave-&gt;replstate == REDIS_REPL_WAIT_BGSAVE_START)
                    freeClient(slave);
            }
        }
    }
}
</pre>
<p>sendBulkToSlave 的逻辑不复杂。它根据slave->repldbfd指向的db，先从dump后的rdb文件中读入db数据，然后发送。发送完后会删除write 事件，设置slave->replstate状态为REDIS_REPL_ONLINE，此后master就会在收到命令后调用call函数，然后使用replicationFeedSlaves同步更新该slave了。replicationFeedSlaves也是遍历slave链表，对处于REDIS_REPL_ONLINE状态的slave，发送当前命令及其参数。</p>
<pre class="wp-code-highlight prettyprint">
 static void sendBulkToSlave(aeEventLoop *el, int fd, void *privdata, int mask) {
    redisClient *slave = privdata;
    REDIS_NOTUSED(el);
    REDIS_NOTUSED(mask);
    char buf[REDIS_IOBUF_LEN];
    ssize_t nwritten, buflen;

    if (slave-&gt;repldboff == 0) {
        /* Write the bulk write count before to transfer the DB. In theory here
         * we don't know how much room there is in the output buffer of the
         * socket, but in pratice SO_SNDLOWAT (the minimum count for output
         * operations) will never be smaller than the few bytes we need. */
        sds bulkcount;

        bulkcount = sdscatprintf(sdsempty(),&quot;$%lld\r\n&quot;,(unsigned long long)
            slave-&gt;repldbsize);
        if (write(fd,bulkcount,sdslen(bulkcount)) != (signed)sdslen(bulkcount))
        {
            sdsfree(bulkcount);
            freeClient(slave);
            return;
        }
        sdsfree(bulkcount);
    }
    lseek(slave-&gt;repldbfd,slave-&gt;repldboff,SEEK_SET);
    buflen = read(slave-&gt;repldbfd,buf,REDIS_IOBUF_LEN);
    if (buflen &lt;= 0) {
        redisLog(REDIS_WARNING,&quot;Read error sending DB to slave: %s&quot;,
            (buflen == 0) ? &quot;premature EOF&quot; : strerror(errno));
        freeClient(slave);
        return;
    }
    if ((nwritten = write(fd,buf,buflen)) == -1) {
        redisLog(REDIS_VERBOSE,&quot;Write error sending DB to slave: %s&quot;,
            strerror(errno));
        freeClient(slave);
        return;
    }
    slave-&gt;repldboff += nwritten;
    if (slave-&gt;repldboff == slave-&gt;repldbsize) {
        close(slave-&gt;repldbfd);
        slave-&gt;repldbfd = -1;
        aeDeleteFileEvent(server.el,slave-&gt;fd,AE_WRITABLE);
        slave-&gt;replstate = REDIS_REPL_ONLINE;
        if (aeCreateFileEvent(server.el, slave-&gt;fd, AE_WRITABLE,
            sendReplyToClient, slave) == AE_ERR) {
            freeClient(slave);
            return;
        }
        addReplySds(slave,sdsempty());
        redisLog(REDIS_NOTICE,&quot;Synchronization with slave succeeded&quot;);
    }
}
</pre>
<p>接下来我们看看redis作为slave是如何运行的。</p>
<p>redis 作为slave（当然也可以使用普通的client作为slave端，这样则跟具体client的实现有关了）时，需要在配置文件中指明master的位置，在loadServerConfig读取配置参数时，会将server.replstate设置为REDIS_REPL_CONNECT状态。处于此状态的redis需要运行到serverCron后才能使用syncWithMaster来和master进行初始同步。查看syncWithMaster的代码可知，其实也向master发布sync命令来建立主从关系的，另外，该函数接收、发送数据时使用的是syncRead、syncWrite函数，而这些函数是阻塞的，因此，redis作为slave运行时，建立最初的主从关系时也是阻塞的。</p>
<pre class="wp-code-highlight prettyprint">
 /* Check if we should connect to a MASTER */
    if (server.replstate == REDIS_REPL_CONNECT &amp;&amp; !(loops % 10)) {
        redisLog(REDIS_NOTICE,&quot;Connecting to MASTER...&quot;);
        if (syncWithMaster() == REDIS_OK) {
            redisLog(REDIS_NOTICE,&quot;MASTER &lt;-&gt; SLAVE sync succeeded&quot;);
            if (server.appendonly) rewriteAppendOnlyFileBackground();
        }
    }
</pre>
<p>另外跟主从复制有关的一个命令就是slaveof命令。此命令是redis主从状态的转换函数，通过前面的分析可知，这只需要更改几个状态即可。</p>
<pre class="wp-code-highlight prettyprint">
static void slaveofCommand(redisClient *c) {
    if (!strcasecmp(c-&gt;argv[1]-&gt;ptr,&quot;no&quot;) &amp;&amp;
        !strcasecmp(c-&gt;argv[2]-&gt;ptr,&quot;one&quot;)) {
        if (server.masterhost) {
            sdsfree(server.masterhost);
            server.masterhost = NULL;
            if (server.master) freeClient(server.master);
            server.replstate = REDIS_REPL_NONE;
            redisLog(REDIS_NOTICE,&quot;MASTER MODE enabled (user request)&quot;);
        }
    } else {
        sdsfree(server.masterhost);
        server.masterhost = sdsdup(c-&gt;argv[1]-&gt;ptr);
        server.masterport = atoi(c-&gt;argv[2]-&gt;ptr);
        if (server.master) freeClient(server.master);
        server.replstate = REDIS_REPL_CONNECT;
        redisLog(REDIS_NOTICE,&quot;SLAVE OF %s:%d enabled (user request)&quot;,
            server.masterhost, server.masterport);
    }
    addReply(c,shared.ok);
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/108.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析18–持久化之aof</title>
		<link>http://www.petermao.com/redis/106.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=106</link>
		<comments>http://www.petermao.com/redis/106.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:49:18 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=57</guid>
		<description><![CDATA[Redis的aof功能的目的是在性能和持久化粒度上对持久化机制提供更好的支持。 快照方式持久化的粒度有时间（秒）和改变的key数两种，如果持久化的粒度较小，对性能会有较大的影响，因为每次都是dump整个db；如果持久化的粒度较大，则在指定时间内指定数目的数据的持久化无法保证。而aof持久化的粒度是每次会修改db数据的命令，因此粒度是最小的了，跟日志方式有点类似，由于仅记录一条命令，性能也最好。另外，跟日志类似，aof文件会越来越大，则可以通过执行BGREWRITEAOF命令在后台重建该文件。 我们先来看看redis如何记录命令的。 call函数是命令执行的函数（前面命令处理章节已详细介绍过该函数）。如果命令执行前后数据有修改，则server.dirty的取值会有变化。在启用了aof机制的情况下，call函数会调用feedAppendOnlyFile保存命令及其相关参数。 static void call(redisClient *c, struct redisCommand *cmd){ long long dirty; dirty = server.dirty; cmd-&#62;proc(c); dirty = server.dirty-dirty; if(server.appendonly &#38;&#38; dirty) feedAppendOnlyFile(cmd,c-&#62;db-&#62;id,c-&#62;argv,c-&#62;argc); --- } feedAppendOnlyFile会首先检查当前命令所处的db是否跟前一条命令执行所处db一致。若不一致，则需要发布一条选择db的select命令，然后做些命令的转换工作（代码略去）。 紧接着，将命令参数所对应的buf保存到server.aofbuf中，该参数保存了一段时间内redis执行的命令及其参数，redis会在适当的时机将其刷到磁盘上的aof文件中；然后如果有后台重建aof文件，则也将该缓冲区保存到server.bgrewritebuf中，该缓冲区保存了重建aof文件的后台进程运行时redis所执行的命令及其参数，后台进程退出时需要将这些命令保存到重建文件中。 static void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc){ &#8230; <a href="http://www.petermao.com/redis/106.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Redis的aof功能的目的是在性能和持久化粒度上对持久化机制提供更好的支持。</p>
<p>快照方式持久化的粒度有时间（秒）和改变的key数两种，如果持久化的粒度较小，对性能会有较大的影响，因为每次都是dump整个db；如果持久化的粒度较大，则在指定时间内指定数目的数据的持久化无法保证。而aof持久化的粒度是每次会修改db数据的命令，因此粒度是最小的了，跟日志方式有点类似，由于仅记录一条命令，性能也最好。另外，跟日志类似，aof文件会越来越大，则可以通过执行BGREWRITEAOF命令在后台重建该文件。</p>
<p>我们先来看看redis如何记录命令的。</p>
<p>call函数是命令执行的函数（前面命令处理章节已详细介绍过该函数）。如果命令执行前后数据有修改，则server.dirty的取值会有变化。在启用了aof机制的情况下，call函数会调用feedAppendOnlyFile保存命令及其相关参数。<span id="more-57"></span></p>
<pre class="wp-code-highlight prettyprint">
static void call(redisClient *c, struct redisCommand *cmd){
   long long dirty;
   dirty = server.dirty;
   cmd-&gt;proc(c);
   dirty = server.dirty-dirty;
   if(server.appendonly &amp;&amp; dirty)
       feedAppendOnlyFile(cmd,c-&gt;db-&gt;id,c-&gt;argv,c-&gt;argc);
   ---
}
</pre>
<p>feedAppendOnlyFile会首先检查当前命令所处的db是否跟前一条命令执行所处db一致。若不一致，则需要发布一条选择db的select命令，然后做些命令的转换工作（代码略去）。</p>
<p>紧接着，将命令参数所对应的buf保存到server.aofbuf中，该参数保存了一段时间内redis执行的命令及其参数，redis会在适当的时机将其刷到磁盘上的aof文件中；然后如果有后台重建aof文件，则也将该缓冲区保存到server.bgrewritebuf中，该缓冲区保存了重建aof文件的后台进程运行时redis所执行的命令及其参数，后台进程退出时需要将这些命令保存到重建文件中。</p>
<pre class="wp-code-highlight prettyprint">
static void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc){
   ---
   server.aofbuf = sdscatlen(server.aofbuf,buf,sdslen(buf));
   ---
   if(server.bgrewritechildpid != -1)
       server.bgrewritebuf = sdscatlen(server.bgrewritebuf,buf,sdslen(buf));
   sdsfree(buf);
}
</pre>
<p>我们来看看server.aofbuf会在什么时机被刷新到磁盘aof文件中。</p>
<p>刷新采用的是flushAppendOnlyFile函数。该函数在beforeSleep中会被调用（事件处理章节已介绍过该函数），而该函数是在处理client事件之前执行执行的（事件循环函数aeMain是先执行beforesleep，然后执行aeProcessEvents），因此，server.aofbuf中的值会在向client发送响应之前刷新到磁盘上。</p>
<p>flushAppendOnlyFile调用write一次性写全部server.aofbuf缓冲区中的数据，并根据配置的同步策略，调用aof_fsync（对系统同步函数fsync的保证）进行同步，这样新的命令及其参数就被附加到aof文件当中了。</p>
<pre class="wp-code-highlight prettyprint">
static void flushAppendOnlyFile(void){
   time_t now;
   ssize_t nwritten;
   ---
    nwritten = write(server.appendfd,server.aofbuf,sdslen(server.aofbuf));
   ---
   sdsfree(server.aofbuf);
   server.aofbuf = sdsempty();

   /* Fsync if needed */
   now = time(NULL);
   if(server.appendfsync == APPENDFSYNC_ALWAYS||
        (server.appendfsync == APPENDFSYNC_EVERYSEC &amp;&amp;
        now-server.lastfsync &gt; 1))
   {
       /* aof_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
       aof_fsync(server.appendfd);/* Let's try to get this data on the disk */
       server.lastfsync = now;
   }
}
</pre>
<p>接下来我们看看后台如何重建aof文件。</p>
<p>aof重建靠调用rewriteAppendOnlyFileBackground函数完成。查看该函数的调用关系就可以知道，该函数会在收到bgrewriteaof命令后执行，也会在收到config命令并且从不使用aof机制到开启aof机制时被调用，也会在运行redis的系统作为slave时，跟master建立连接后并在serverCron函数中执行syncWithMaster时调用。<br />
rewriteAppendOnlyFileBackground重建aof的主要逻辑如下（代码略去）：</p>
<p>1）使用fork创建一个子进程</p>
<p>2）子进程调用rewriteAppendOnlyFile在一个临时文件里写能够反映当前db状态的数据和命令，</p>
<p>     此时父进程会把这段时间内执行的能够改变当前db数据的命令放到server.bgrewritebuf中（参看前面对feedAppendOnlyFile的解释）</p>
<p>3）当子进程退出时，父进程收到信号，将上面的内存缓冲区中的数据flush到临时文件中，然后将临时文件rename成新的aof文件（backgroundRewriteDoneHandler）。</p>
<p>父进程会在serverCron函数中等待执行aof重写或者快照保存的子进程，代码如下：</p>
<pre class="wp-code-highlight prettyprint">
/* Check if a background saving or AOF rewrite in progress terminated */
  if(server.bgsavechildpid != -1||server.bgrewritechildpid != -1){
      int statloc;
      pid_t pid;

      if((pid = wait3(&amp;statloc,WNOHANG,NULL))!= 0){
          if(pid == server.bgsavechildpid){
              backgroundSaveDoneHandler(statloc);
          } else {
              backgroundRewriteDoneHandler(statloc);
          }
          updateDictResizePolicy();
      }
  }
</pre>
<p>rewriteAppendOnlyFile将反映当前db状态的命令和参数写到一个临时文件中。该函数遍历db中的每条数据，redis中的db其实是一个大的hash表，每一条数据都用（key,val）来表示。从key可以知道val的类型（redis支持REDIS_STRING、REDIS_LIST、REDIS_SET、REDIS_ZSET、REDIS_HASH五种数据类型），然后解码val中的数据。写入时，按照客户端执行命令的形式写入。比如对于REDIS_STRING类型，则先写入&#8221;*3\r\n$3\r \nSET\r\n&#8221;，然后写入set的key，然后写入val；对于REDIS_LIST类型，将val强制转换为list类型后，先写入&#8221;*3\r \n$5\r\nRPUSH\r\n&#8221;，然后写入要操作的list的名字，然后写入list的第一个数据，循环前面3个步骤直到list遍历完；对于REDIS_SET类型，则对于每条数据先写入&#8221;*3\r\n$4\r\nSADD\r\n&#8221;；对于REDIS_ZSET类型，则对于每条数据先写入&#8221;*4\r\n$4\r\nZADD\r\n&#8221;；对于REDIS_HASH类型，则对于每条数据先写入&#8221;*4\r\n$4\r\nHSET\r\n&#8221;（代码简单但较琐碎，略去）。</p>
<p>最后我们介绍下redis启动时使用aof重建db的步骤。</p>
<p>启动时重建的关键是构建一个fake client，然后使用这个client向server发送从aof文件中读入的命令。</p>
<pre class="wp-code-highlight prettyprint">
int loadAppendOnlyFile(char *filename){
   ---
   fakeClient = createFakeClient();
   while(1){
       ---
       if(fgets(buf,sizeof(buf),fp)== NULL){
          ---
       }
      // 解析buf为对应的命令及参数
      // 查找命令
       cmd = lookupCommand(argv[0]-&gt;ptr);
      ---

      // 执行命令
       cmd-&gt;proc(fakeClient);
     ---
   }
   ---
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/106.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析17–持久化之快照</title>
		<link>http://www.petermao.com/redis/104.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=104</link>
		<comments>http://www.petermao.com/redis/104.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:45:26 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=54</guid>
		<description><![CDATA[redis的持久化支持快照方式。快照方式会将整个db dump到磁盘上。 client 可以发布save/bgsave命令让server将db dump到磁盘上。其中bgsave会执行后台dump（新建子进程执行dump），而save是阻塞式的dump db，会影响其他client的命令执行。除了发布命令执行快照保存外，redis的serverCron也会按照配置的参数执行后台dump，另外 slave建立连接时，master也会执行一个后台dump，然后才发送数据给slave（这在主从复制一节中介绍）。 static int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { --- /* Check if a background saving or AOF rewrite in progress terminated */ if (server.bgsavechildpid != -1 &#124;&#124; server.bgrewritechildpid != -1) &#8230; <a href="http://www.petermao.com/redis/104.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>redis的持久化支持快照方式。快照方式会将整个db dump到磁盘上。</p>
<p>client 可以发布save/bgsave命令让server将db dump到磁盘上。其中bgsave会执行后台dump（新建子进程执行dump），而save是阻塞式的dump db，会影响其他client的命令执行。除了发布命令执行快照保存外，redis的serverCron也会按照配置的参数执行后台dump，另外 slave建立连接时，master也会执行一个后台dump，然后才发送数据给slave（这在主从复制一节中介绍）。<span id="more-54"></span></p>
<pre class="wp-code-highlight prettyprint">
static int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
    ---
/* Check if a background saving or AOF rewrite in progress terminated */
    if (server.bgsavechildpid != -1 || server.bgrewritechildpid != -1) {
      ---
    }
else {
        /* If there is not a background saving in progress check if
         * we have to save now */
         time_t now = time(NULL);
         for (j = 0; j &lt; server.saveparamslen; j++) {
            struct saveparam *sp = server.saveparams+j;

            if (server.dirty &gt;= sp-&gt;changes &amp;&amp;
                now-server.lastsave &gt; sp-&gt;seconds) {
                redisLog(REDIS_NOTICE,&quot;%d changes in %d seconds. Saving...&quot;,
                    sp-&gt;changes, sp-&gt;seconds);
                rdbSaveBackground(server.dbfilename);
                break;
            }
         }
}
---
}
</pre>
<p>无论是新建子进程还是阻塞式的执行快照方式（新建子进程方式会先调用rdbSaveBackground），最终都会调用rdbSave来保存db。</p>
<p>在rdbSave中可以看到，redis是按type、key、val方式来保存db中的数据的。</p>
<p>rdbLoad是快照方式保存数据后server启动时加载数据的函数，是rdbSave的逆过程。</p>
<pre class="wp-code-highlight prettyprint">
static int rdbSave(char *filename) {

    ---
    for (j = 0; j &lt; server.dbnum; j++) {
        redisDb *db = server.db+j;
       ---
        /* Iterate this DB writing every entry */
        while((de = dictNext(di)) != NULL) {
            robj *key = dictGetEntryKey(de);
            robj *o = dictGetEntryVal(de);
            time_t expiretime = getExpire(db,key);
            ---
            /* Save type, key, value */
           if (rdbSaveType(fp,o-&gt;type) == -1) goto werr;
          if (rdbSaveStringObject(fp,key) == -1) goto werr;
          if (rdbSaveObject(fp,o) == -1) goto werr;
          ---
        }
        dictReleaseIterator(di);
    }
    ---
    /* Use RENAME to make sure the DB file is changed atomically only
     * if the generate DB file is ok. */
    if (rename(tmpfile,filename) == -1) {
        redisLog(REDIS_WARNING,&quot;Error moving temp DB file on the final destination: %s&quot;, strerror(errno));
        unlink(tmpfile);
        return REDIS_ERR;
    }
    ---
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/104.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>redis源代码分析16–阻塞式命令</title>
		<link>http://www.petermao.com/redis/102.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=102</link>
		<comments>http://www.petermao.com/redis/102.html#comments</comments>
		<pubDate>Sun, 01 May 2011 10:43:28 +0000</pubDate>
		<dc:creator>petermao</dc:creator>
				<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://www.petermao.com/?p=52</guid>
		<description><![CDATA[redis现在只支持对list的阻塞式操作，相关的两个命令是brpop和blpop。 这两个命令在list中有元素时，跟普通的pop没有区别，弹出list的一个元素，然后返回。但在list没有元素时，会为redisClient设置REDIS_BLOCKED标志，然后client阻塞（设置REDIS_BLOCKED标志的redisClient会一直阻塞，参考命令处理章节），一直到新元素加入时（push操作的处理函数pushGenericCommand），才会返回。 这两个命令设置的处理函数brpopCommand和blpopCommand都会调用blockingPopGenericCommand。该函数在检查list中有元素后，会调用非阻塞的popGenericCommand来弹出一个元素，否则调用blockForKeys来处理阻塞的情况。 /* Blocking RPOP/LPOP */ static void blockingPopGenericCommand(redisClient *c, int where) { robj *o; long long lltimeout; time_t timeout; int j; /* Make sure timeout is an integer value */ if (getLongLongFromObjectOrReply(c,c-&#62;argv[c-&#62;argc-1],&#38;lltimeout, &#34;timeout is not an integer&#34;) != &#8230; <a href="http://www.petermao.com/redis/102.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>redis现在只支持对list的阻塞式操作，相关的两个命令是brpop和blpop。</p>
<p>这两个命令在list中有元素时，跟普通的pop没有区别，弹出list的一个元素，然后返回。但在list没有元素时，会为redisClient设置REDIS_BLOCKED标志，然后client阻塞（设置REDIS_BLOCKED标志的redisClient会一直阻塞，参考命令处理章节），一直到新元素加入时（push操作的处理函数pushGenericCommand），才会返回。</p>
<p>这两个命令设置的处理函数brpopCommand和blpopCommand都会调用blockingPopGenericCommand。该函数在检查list中有元素后，会调用非阻塞的popGenericCommand来弹出一个元素，否则调用blockForKeys来处理阻塞的情况。<span id="more-52"></span></p>
<pre class="wp-code-highlight prettyprint">
/* Blocking RPOP/LPOP */
static void blockingPopGenericCommand(redisClient *c, int where) {
    robj *o;
    long long lltimeout;
    time_t timeout;
    int j;

    /* Make sure timeout is an integer value */
    if (getLongLongFromObjectOrReply(c,c-&gt;argv[c-&gt;argc-1],&amp;lltimeout,
            &quot;timeout is not an integer&quot;) != REDIS_OK) return;

    /* Make sure the timeout is not negative */
    if (lltimeout &lt; 0) {
        addReplySds(c,sdsnew(&quot;-ERR timeout is negative\r\n&quot;));
        return;
    }

    for (j = 1; j &lt; c-&gt;argc-1; j++) {
        o = lookupKeyWrite(c-&gt;db,c-&gt;argv[j]);
        if (o != NULL) {
            if (o-&gt;type != REDIS_LIST) {
                addReply(c,shared.wrongtypeerr);
                return;
            } else {
                list *list = o-&gt;ptr;
                if (listLength(list) != 0) {
                    /* If the list contains elements fall back to the usual
                     * non-blocking POP operation */
                    robj *argv[2], **orig_argv;
                    int orig_argc;

                    /* We need to alter the command arguments before to call
                     * popGenericCommand() as the command takes a single key. */
                    orig_argv = c-&gt;argv;
                    orig_argc = c-&gt;argc;
                    argv[1] = c-&gt;argv[j];
                    c-&gt;argv = argv;
                    c-&gt;argc = 2;

                    /* Also the return value is different, we need to output
                     * the multi bulk reply header and the key name. The
                     * &quot;real&quot; command will add the last element (the value)
                     * for us. If this souds like an hack to you it's just
                     * because it is... */
                    addReplySds(c,sdsnew(&quot;*2\r\n&quot;));
                    addReplyBulk(c,argv[1]);
                    popGenericCommand(c,where);

                    /* Fix the client structure with the original stuff */
                    c-&gt;argv = orig_argv;
                    c-&gt;argc = orig_argc;
                    return;
                }
            }
        }
    }

    /* If we are inside a MULTI/EXEC and the list is empty the only thing
     * we can do is treating it as a timeout (even with timeout 0). */
    if (c-&gt;flags &amp; REDIS_MULTI) {
        addReply(c,shared.nullmultibulk);
        return;
    }

    /* If the list is empty or the key does not exists we must block */
    timeout = lltimeout;
    if (timeout &gt; 0) timeout += time(NULL);
    blockForKeys(c,c-&gt;argv+1,c-&gt;argc-2,timeout);
}
</pre>
<p>blockForKeys会在db->blockingkeys记下client和等待的key的对应关系，然后给client设置REDIS_BLOCKED标志，这样client就一直阻塞了。</p>
<pre class="wp-code-highlight prettyprint">
static void blockForKeys(redisClient *c, robj **keys, int numkeys, time_t timeout) {
    dictEntry *de;
    list *l;
    int j;
    ---
    if (c-&gt;fd &lt; 0) return;

    c-&gt;blockingkeys = zmalloc(sizeof(robj*)*numkeys);
    c-&gt;blockingkeysnum = numkeys;
    c-&gt;blockingto = timeout;
    for (j = 0; j &lt; numkeys; j++) {
        /* Add the key in the client structure, to map clients -&gt; keys */
        c-&gt;blockingkeys[j] = keys[j];
        incrRefCount(keys[j]);

        /* And in the other &quot;side&quot;, to map keys -&gt; clients */
        de = dictFind(c-&gt;db-&gt;blockingkeys,keys[j]);
        if (de == NULL) {
            int retval;

            /* For every key we take a list of clients blocked for it */
            l = listCreate();
            retval = dictAdd(c-&gt;db-&gt;blockingkeys,keys[j],l);
            incrRefCount(keys[j]);
            assert(retval == DICT_OK);
        } else {
            l = dictGetEntryVal(de);
        }
        listAddNodeTail(l,c);
    }
    /* Mark the client as a blocked client */
    c-&gt;flags |= REDIS_BLOCKED;
    server.blpop_blocked_clients++;
}
</pre>
<p>等待的client会一直阻塞，直到有push操作，此时会调用unblockClientWaitingData来解除client的阻塞。</p>
<pre class="wp-code-highlight prettyprint">
/* Unblock a client that's waiting in a blocking operation such as BLPOP */
// 减少对所阻塞对象的引用
static void unblockClientWaitingData(redisClient *c) {
    dictEntry *de;
    list *l;
    int j;

    assert(c-&gt;blockingkeys != NULL);
    /* The client may wait for multiple keys, so unblock it for every key. */
    for (j = 0; j &lt; c-&gt;blockingkeysnum; j++) {
        /* Remove this client from the list of clients waiting for this key. */
        de = dictFind(c-&gt;db-&gt;blockingkeys,c-&gt;blockingkeys[j]);
        assert(de != NULL);
        l = dictGetEntryVal(de);
        listDelNode(l,listSearchKey(l,c));
        /* If the list is empty we need to remove it to avoid wasting memory */
        if (listLength(l) == 0)
            dictDelete(c-&gt;db-&gt;blockingkeys,c-&gt;blockingkeys[j]);
        decrRefCount(c-&gt;blockingkeys[j]);
    }
    /* Cleanup the client structure */
    zfree(c-&gt;blockingkeys);
    c-&gt;blockingkeys = NULL;
    c-&gt;flags &amp;= (~REDIS_BLOCKED);
    server.blpop_blocked_clients--;
    /* We want to process data if there is some command waiting
     * in the input buffer. Note that this is safe even if
     * unblockClientWaitingData() gets called from freeClient() because
     * freeClient() will be smart enough to call this function
     * *after* c-&gt;querybuf was set to NULL. */
    if (c-&gt;querybuf &amp;&amp; sdslen(c-&gt;querybuf) &gt; 0) processInputBuffer(c);
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.petermao.com/redis/102.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
