我们在前面介绍Actor系统时说过每个Actor都是其子Actor的管理员,并且每个Actor定义了发生错误时的管理策略,策略一旦定义好,之后不能修改,就像是Actor系统不可分割的一部分。
实用错误处理
首先我们来看一个例子来显示一种处理数据存储错误的情况,这是现实中一个应用可能出现的典型错误。当然实际的应用可能针对数据源不存在时有不同的处理,这里我们使用重新连接的处理方法。
下面是例子的源码,比较长,需要仔细阅读,最好是实际运行,参考日志来理解:
import akka.actor._
import akka.actor.SupervisorStrategy._
import scala.concurrent.duration._
import akka.util.Timeout
import akka.event.LoggingReceive
import akka.pattern.{ask, pipe}
import com.typesafe.config.ConfigFactory
/**
* Runs the sample
*/
object FaultHandlingDocSample extends App {
import Worker._
val config = ConfigFactory.parseString( """
akka.loglevel = "DEBUG"
akka.actor.debug {
receive = on
lifecycle = on
}
""")
val system = ActorSystem("FaultToleranceSample", config)
val worker = system.actorOf(Props[Worker], name = "worker")
val listener = system.actorOf(Props[Listener], name = "listener")
// start the work and listen on progress
// note that the listener is used as sender of the tell,
// i.e. it will receive replies from the worker
worker.tell(Start, sender = listener)
}
/**
* Listens on progress from the worker and shuts down the system when enough
* work has been done.
*/
class Listener extends Actor with ActorLogging {
import Worker._
// If we don’t get any progress within 15 seconds then the service is unavailable
context.setReceiveTimeout(15 seconds)
def receive = {
case Progress(percent) =>
log.info("Current progress: {} %", percent)
if (percent >= 100.0) {
log.info("That’s all, shutting down")
context.system.shutdown()
}
case ReceiveTimeout =>
// No progress within 15 seconds, ServiceUnavailable
log.error("Shutting down due to unavailable service")
context.system.shutdown()
}
}
object Worker {
case object Start
case object Do
final case class Progress(percent: Double)
}
/**
* Worker performs some work when it receives the ‘Start‘ message.
* It will continuously notify the sender of the ‘Start‘ message
* of current ‘‘Progress‘‘. The ‘Worker‘ supervise the ‘CounterService‘.
*/
class Worker extends Actor with ActorLogging {
import Worker._
import CounterService._
implicit val askTimeout = Timeout(5 seconds)
// Stop the CounterService child if it throws ServiceUnavailable
override val supervisorStrategy = OneForOneStrategy() {
case _: CounterService.ServiceUnavailable => Stop
}
// The sender of the initial Start message will continuously be notified
// about progress
var progressListener: Option[ActorRef] = None
val counterService = context.actorOf(Props[CounterService], name = "counter")
val totalCount = 51
import context.dispatcher
// Use this Actors’ Dispatcher as ExecutionContext
def receive = LoggingReceive {
case Start if progressListener.isEmpty =>
progressListener = Some(sender())
context.system.scheduler.schedule(Duration.Zero, 1 second, self, Do)
case Do =>
counterService ! Increment(1)
counterService ! Increment(1)
counterService ! Increment(1)
// Send current progress to the initial sender
counterService ? GetCurrentCount map {
case CurrentCount(_, count) => Progress(100.0 * count / totalCount)
} pipeTo progressListener.get
}
}
object CounterService {
final case class Increment(n: Int)
case object GetCurrentCount
final case class CurrentCount(key: String, count: Long)
class ServiceUnavailable(msg: String) extends RuntimeException(msg)
private case object Reconnect
}
/**
* Adds the value received in ‘Increment‘ message to a persistent
* counter. Replies with ‘CurrentCount‘ when it is asked for ‘CurrentCount‘.
* ‘CounterService‘ supervise ‘Storage‘ and ‘Counter‘.
*/
class CounterService extends Actor {
import CounterService._
import Counter._
import Storage._
// Restart the storage child when StorageException is thrown.
// After 3 restarts within 5 seconds it will be stopped.
override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 3,
withinTimeRange = 5 seconds) {
case _: Storage.StorageException => Restart
}
val key = self.path.name
var storage: Option[ActorRef] = None
var counter: Option[ActorRef] = None
var backlog = IndexedSeq.empty[(ActorRef, Any)]
val MaxBacklog = 10000
import context.dispatcher
// Use this Actors’ Dispatcher as ExecutionContext
override def preStart() {
initStorage()
}
/**
* The child storage is restarted in case of failure, but after 3 restarts,
* and still failing it will be stopped. Better to back-off than continuously
* failing. When it has been stopped we will schedule a Reconnect after a delay.
* Watch the child so we receive Terminated message when it has been terminated.
*/
def initStorage() {
storage = Some(context.watch(context.actorOf(Props[Storage], name = "storage")))
// Tell the counter, if any, to use the new storage
counter foreach {
_ ! UseStorage(storage)
}
// We need the initial value to be able to operate
storage.get ! Get(key)
}
def receive = LoggingReceive {
case Entry(k, v) if k == key && counter == None =>
// Reply from Storage of the initial value, now we can create the Counter
val c = context.actorOf(Props(classOf[Counter], key, v))
counter = Some(c)
// Tell the counter to use current storage
c ! UseStorage(storage)
// and send the buffered backlog to the counter
for ((replyTo, msg) <- backlog) c.tell(msg, sender = replyTo)
backlog = IndexedSeq.empty
case msg@Increment(n) => forwardOrPlaceInBacklog(msg)
case msg@GetCurrentCount => forwardOrPlaceInBacklog(msg)
case Terminated(actorRef) if Some(actorRef) == storage =>
// After 3 restarts the storage child is stopped.
// We receive Terminated because we watch the child, see initStorage.
storage = None
// Tell the counter that there is no storage for the moment
counter foreach {
_ ! UseStorage(None)
}
// Try to re-establish storage after while
context.system.scheduler.scheduleOnce(10 seconds, self, Reconnect)
case Reconnect =>
// Re-establish storage after the scheduled delay
initStorage()
}
def forwardOrPlaceInBacklog(msg: Any) {
// We need the initial value from storage before we can start delegate to
// the counter. Before that we place the messages in a backlog, to be sent
// to the counter when it is initialized.
counter match {
case Some(c) => c forward msg
case None =>
if (backlog.size >= MaxBacklog)
throw new ServiceUnavailable(
"CounterService not available, lack of initial value")
backlog :+= (sender() -> msg)
}
}
}
object Counter {
final case class UseStorage(storage: Option[ActorRef])
}
/**
* The in memory count variable that will send current
* value to the ‘Storage‘, if there is any storage
* available at the moment.
*/
class Counter(key: String, initialValue: Long) extends Actor {
import Counter._
import CounterService._
import Storage._
var count = initialValue
var storage: Option[ActorRef] = None
def receive = LoggingReceive {
case UseStorage(s) =>
storage = s
storeCount()
case Increment(n) =>
count += n
storeCount()
case GetCurrentCount =>
sender() ! CurrentCount(key, count)
}
def storeCount() {
// Delegate dangerous work, to protect our valuable state.
// We can continue without storage.
storage foreach {
_ ! Store(Entry(key, count))
}
}
}
object DummyDB {
import Storage.StorageException
private var db = Map[String, Long]()
@throws(classOf[StorageException])
def save(key: String, value: Long): Unit = synchronized {
if (11 <= value && value <= 14)
throw new StorageException("Simulated store failure " + value)
db += (key -> value)
}
@throws(classOf[StorageException])
def load(key: String): Option[Long] = synchronized {
db.get(key)
}
}
object Storage {
final case class Store(entry: Entry)
final case class Get(key: String)
final case class Entry(key: String, value: Long)
class StorageException(msg: String) extends RuntimeException(msg)
}
/**
* Saves key/value pairs to persistent storage when receiving ‘Store‘ message.
* Replies with current value when receiving ‘Get‘ message.
* Will throw StorageException if the underlying data store is out of order.
*/
class Storage extends Actor {
import Storage._
val db = DummyDB
def receive = LoggingReceive {
case Store(Entry(key, count)) => db.save(key, count)
case Get(key) => sender() ! Entry(key, db.load(key).getOrElse(0L))
}
}
这个例子定义了五个Actor,分别是Worker, Listener, CounterService ,Counter 和 Storage,下图给出了系统正常运行时的流程(无错误发生的情况):

其中Worker是CounterService的父Actor(管理员),CounterService是Counter和Storage的父Actor(管理员)图中浅红色,白色代表引用,其中Worker引用了Listener,Listener也引用了Worker,它们之间不存在父子关系,同样Counter也引用了Storage,但Counter不是Storage的管理员。
正常流程如下:
步骤 |
描述 |
1 |
progress Listener 通知Worker开始工作. |
2 |
Worker通过定时发送Do消息给自己来完成工作 |
3,4,5 |
Worker接受到Do消息时,通知其子Actor CounterService 三次递增计数器,
CounterService 将Increment消息转发给Counter,它将递增计数器变量然后把当前值发送给Storeage保存 |
6,7 |
Workier询问CounterService 当前计数器的值,然后通过管道把结果传给Listener |
下图给出系统出错的情况,例子中Worker和CounterService作为管理员分别定义了两个管理策略,Worker在收到CounterService 的ServiceUnaviable上终止CounterService的运行,而CounterService在收到StorageException时重启Storage。

出错时的流程
步骤 |
描述 |
1 |
Storage抛出StorageException异常 |
2 |
Storage的管理员CounterService根据策略在接受到StorageException异常后重启Storage |
3,4,5,6 |
Storage继续出错并重启 |
7 |
如果在5秒钟之内Storage出错三次并重启,其管理员(CounterService)就终止Storage运行 |
8 |
CounterService 同时监听Storage的Terminated消息,它在Storeage终止后接受到Terminated消息 |
9,10,11 |
并且通知Counter 暂时没有Storage |
12 |
CounterService 延时一段时间给自己发生Reconnect消息 |
13,14 |
当它收到Reconnect消息时,重新创建一个Storage |
15,16 |
然后通知Counter使用新的Storage |
这里给出运行的一个日志供参考。