DSC Tech Library
This section of our technical library presents information and documentation relating to Predictive Dialers and Auto Dialer software and products.
The PACER and Wizard phone systems are PC based call center phone systems that are recognized as premier inbound and outbound computer telephony systems. Features such as automatic call distribution (ACD), interactive voice response (IVR) and call recording have added a new dimension to the predictive dialer and auto dialer capabilities of these systems. These computer based dialing systems can perform various types of auto dialing campaigns simultaneously. These types include Predictive Dialing, Progressive Dialing, Preview Dialing and Dial on Demand.
System Architectures for Speech
Techland Group, www.techland.com.uk
It’s certainly no secret that the last few years have brought some dramatic changes to the high-tech landscape. The harsh economic reality tested many business formulas, technologies and products. Unfortunately, many of the once widely hyped ideas did not survive and all of us have learned an important lesson: at the end of the day, the only successful products/solutions are those that make money. In other words, it’s all about cost, price and, above all, Return on Investment (ROI).
So, how can we build a compelling ROI case for speech technology? Of course the return depends on the business side of your application/project. Because this is a technical paper, we’ll only briefly touch on this aspect. But the investment is mostly about technology: it’s the cost of the building blocks that you use and the engineering and tuning hours that you spend. We’ll take a closer look at this aspect, because the technical decisions made early in project can have a dramatic impact on the final cost. No matter how much engineers hate this, more often than not, it is the cost that kills the great product ideas.
With the above in mind, this paper will review different technologies, architectures and implementation strategies for building speech applications, with a special focus on costs versus functionality.
Throughout this paper we’ll refer to the example of the Airport Assistant, a real life speech-enabled application implemented by one of our partners. Airport Assistant allows callers to select by name one of several hundred airport services, request real-time flight information and be notified (at a provided cell-number) about any changes in the schedule. Airport Assistant has been successfully deployed at the biggest airport in Canada - Toronto’s Pearson International.
2 Why isn’t speech ubiquitous?
First, let’s put things in perspective by asking another question: how much is speech used today? According to Datamonitor, the total supply-side market for “voice business technologies and services” (meaning all systems employing TTS and ASR) is growing from $629 million in 2001 to $4.3 billion in 2007. That’s a compound annual growth rate (CAGR) of 38%. It looks impressive, but only until you realize a key point: these are total numbers that must be further broken down into verticals. For example, if you are in the healthcare and pharmaceuticals business, your share is only 2% (again, a Datamonitor number), which for 2001 translated to $12.5 million worldwide. Obviously, this represents a low penetration by any stretch of imagination.
So why, despite almost twenty years of continuous improvements in technology, are speech applications still so slow to sell? Most analysts agree that the technology itself is “ready for prime time” and that customers are ready to accept its clear benefits. Where is the problem? Many different reasons have been cited, from numerous false starts of immature technology, to bad design, to cultural aversion against “talking to a machine”. They all are true, but in our opinion, a major remaining barrier to wider adoption is cost. The remainder of this section discusses the main factors contributing to the high cost of speech applications today.
2.1 Cost of Complexity
Despite many years of great technological progress, building voice applications is still a complex task. It typically involves integration of multiple core technologies: ASR, TTS, speaker verification, speech object libraries, telephony hardware, call processing, web services and databases, all tied together with thousands of lines of custom code. Although all vendors strive to position themselves as one stop shopping, none are leaders in all areas, which forces system architects to pick and choose best-of-breed components from multiple suppliers. A classic example is multilingual TTS – no vendor has the best voice quality in all languages.
Of course, components coming from different vendors are not designed to work easily together, which makes implementation difficult and prolongs the learning curve. It also requires a team with a diverse and sophisticated skill set: telephony hardware, protocols, real-time programming and Web development, to name a few. Even today, experienced developers and voice system designers are not cheap and their salaries quickly add up to the cost of a project.
2.2 Cost of Sophisticated but Imperfect Technology
When compared to touchtone, there is no doubt that speech technology allows for a much more effective and satisfying user experience. However, those benefits come at a price: building a good speech application requires a lot more effort invested in the human-machine interface (HMI). Touchtone based user interfaces were easy to design: the user input was always limited to one out of ten digits and the DTMF detectors guaranteed almost 100% accuracy.
With speech and natural language, the situation is vastly different. Not only is recognition never 100% accurate but also the user input is not restricted in any way. This puts a lot of extra burden on the application design and dramatically increases its complexity – it is not uncommon that more programming logic is dedicated to compensating for errors and failures (low accuracy, ambiguity, out-of-grammar vocabulary, and so on) than to the actual business features. As a result, designing a good speech interface is a difficult art requiring the interdisciplinary talents of computer science, linguistics and human factors psychology.
2.3 Cost of Core Speech Technology (Licenses)
Today, the core speech technology is offered by a handful of vendors who have spent heavily on development of their products and are now trying (rightly so) to recoup and capitalize on their investment. Consequently, TTS and ASR licensees continue to be very expensive.
As an example, the ASR licenses (no TTS was used) for the Toronto Airport Assistant application, described in the introduction, originally accounted for almost 50% of the total system cost. The remaining 50% paid for everything else, including redundant hardware, development tools, database server, UPS and more. The cost of ASR licenses was later reduced to less than 23%, by better license management that took advantage of the specific call patterns. This approach is discussed in more details in the following sections.
With such high licensing costs, it is unfortunate that many commercial speech platforms seem to completely ignore the issue and continue to use licenses very ineffectively. It is not uncommon that one application port could require two or more ASR licenses, especially if multiple languages or “always active hot-words” are involved. Obviously, doubling or tripling the licensing cost has a dramatic impact on the end user price of the finished application.
2.4 Cost of Tuning
The high cost of speech applications doesn’t end with the first installation. Even the best-designed systems require a lot of tuning before they can be turned into a full production. Furthermore, some applications (such as dial-by-name auto-attendants needing constant addition or deletion of subscribers to their grammars) require ongoing tuning throughout their life cycle. Tuning typically requires the attention of a computational linguist, which is still a unique and therefore hard to find professional. This is in sharp contrast to the traditional touchtone systems, which, once tested by a software QA team, would run virtually maintenance-free in production.
3 Architecting for Lower Cost
We’ve identified cost as one of the main barriers to the wider adoption of speech applications, in particular in mid-market environments. High up-front costs result in marginal ROI stories, and it doesn’t matter how elegant or efficient the application is if no one buys it. Thus, it is important to design for the best cost-functionality balance.
Today, system architects and developers of speech-based telephony applications face many difficult choices regarding platforms, tools, speech technologies, and so on. The abundance of new standards only adds to the overall confusion. Of course, no single system architecture could meet all possible requirements, but at the same time, selecting the right architecture has fundamental impact, especially on cost. This section offers a number of specific recommendations based on our real-life experiences.
3.1 Simplify the Complex
Modern speech applications are very rarely (if ever) built from scratch. Instead, multiple ready-to-go building blocks are put together including: ASR and TTS engines, object libraries, call processing frameworks, development tools. As a result, building a modern speech application has become a matter of integration. This has certainly simplified the task, but it hasn’t made it easy – integration comes with problems of its own.
Typically, speech building blocks (cards or speech engines) come with native-APIs that is low-level interfaces requiring programming in C++. But writing a speech application in C++ is not for the faint-of-heart. It may sound like a fun challenge to developers, but certainly not to a project manager who is responsible for budgeting and scheduling. Low-level details (such as call state machines, resource management, multi-threading, voice buffering, ActiveX, COM, and sockets) will very quickly defocus developers from solving the actual business problem at hand.
Learning APIs and telephony abstraction models from different vendors will dramatically extend the learning curve. Don’t count on the telephony standards for help: unfortunately, despite multiple attempts (such as TAPI, SAPI, TSAPI, and S.100), there is no universal standard API for low-level telephony functionality. Adoption by vendors is random at best, and interoperability of particular implementations can never be taken for granted.
Can this complexity of native-APIs be avoided? Absolutely - by leveraging the work done by others. In practice, this means using one of the high-level Rapid Application Development (RAD) tools. RAD tools hide low-level complexity and abstract the mishmash of multi-vendor components into a uniform development environment, focused on the business logic, not technology. Some RAD tools reduce the learning curve even more by leveraging the power of one of the industry standard Integrated Development Environments (IDE), for example Microsoft Visual Studio, and popular development languages, such as Visual Basic or Java.
Properly selected development tools can save a lot of money. On the Toronto Airport Assistant project, both the developers and project manager claimed that the switch to an appropriate tool cut the delivery time by more than 50%.
Of course, not all RAD tools are created equally and they should be carefully selected for a specific project and its development team. Ideally, you should look for a tool that combines a high-level visual design environment with a flexible programming environment.
Visual Design – Most RAD tools use drag-and-drop GUI interfaces, which increase programmer productivity and enhance structure and readability of the source code. However, when selecting a tool, consider inherent limitations of visual design and programming. Ready-to-use building blocks work very well as long as all functionality is available.
Some applications may benefit from the simplicity of this approach; however, most practical applications require functionality that goes beyond ready-to-go functions. Sooner or later, you’ll need to customize blocks or integrate them with external third-party components. In other words, choose tools that combine the productivity gains of visual programming with a powerful programming environment.
Programming Environment – Building speech applications today is all about integration and customization. Any serious application requires some custom programming. The right development and debugging tools can save your project when you least expect it. Make absolutely sure that your tools support a serious, industry-standard programming language, source level debugging, seamless invocation of component libraries (such as DCOM, ActiveX, and CORBA.) and control over multithreading. If you’re building an application on Windows, don’t miss out on the benefits of the next-generation technology from Microsoft– your tool must support .NET!
As a typical example, one of our customers built a large outbound dialing application that depended on answering machine detection. The initial implementation used the original detection algorithm embedded into Dialogic cards. Unfortunately, statistical accuracy (which depends highly on the target calling area) couldn’t be verified before field trials. When the first results came in, accuracy levels were around 80%, significantly below expectations. Because their RAD tool fully integrated into Microsoft Visual Studio, the programmers were able to devise a custom solution that boosted accuracy to 96%. This would be impossible without a flexible programming environment.
3.2 Breakup into Modules
The ability to break your speech application into cooperating modules is a must. Not only does it improve scalability, reliability and performance of your system, but it also saves you money in both development and production.
A modular system is cheaper to build and maintain. In development, programmers benefit from working in parallel on well-defined modules. In production, independent module provisioning and software hot-swaps eliminate costly system downtimes. At the same time, separating application logic from telephony and speech processing allows resource sharing, which in turn leads to more efficient utilization. Finally, distributing your modules across a LAN enables load balancing and effortless scalability – again resulting in savings on system maintenance.
The biggest benefit, however, comes from increased reliability of a modular system. Nothing is more frustrating to callers than a system that crashes into “dead silence” in the middle of a transaction. An unreliable system will be soon pulled out of production, which always means significant financial losses.
A monolithic executable is only as reliable as its weakest component, while a modular system can stay operational even after losing one of its modules. Therefore, it is very important that application modules execute properly separated from each other and from the system processes, so that a fatal error in one doesn’t bring down the whole system. The modules should run out-of-process, or even better, distributed across a LAN. Ideally, modules should be compiled directly into stand-alone executables, not into intermediate scripts or p-code. Not only does this speed up program execution, it also removes the dependency on a shared runtime engine as a single point of failure.
3.3 Maintain Vendor Independence
From a cost perspective, vendor-independence comes into play with respect to speech resources and telephony hardware.
Speech Resources – Selecting the right speech components for your application is very important. Speech technology is still complex and very expensive, but the quality and accuracy of the chosen engines could ultimately make a difference between success or failure of your project.
In practice, when it comes to speech processing, the “one size fits all” approach does not work. This is true for ASR, but even more so for TTS - all engines and languages are not equal. Therefore, it pays to carefully research and evaluate different vendors before deciding on a TTS product. Remember that perception of quality is subjective and may depend on your audience and application. As a result you may end up working with more than one vendor at a time and changing vendors as your application evolves.
Unfortunately, this may require re-doing the integration work many times, resulting in additional cost. If you’re lucky, the engine of your choice may support a standard API (like SAPI), but not all do. And even if it does support SAPI, these interfaces are often far from perfect. Fine variations in timing, buffering schemes and performance can result in irritating gaps, clicks and delays. Usually, better results are achieved through individually crafted, native APIs, but this again means additional development costs.
The best way to achieve vendor-independence is to use a middleware abstraction layer, which in turn works with number of alternate engines. Again, a RAD tool is appropriate: it will protect your investment in application development should a shift in requirements, technology or vendor strategies necessitate the move to another engine. It will also allow you to experiment with multiple speech products to find the best price-quality balance for your application.
Telephony Hardware – Similar to speech resources, vendor-independence of telephony hardware can save you money. It is not uncommon for speech applications to run on more than one brand of hardware or to switch vendors for better pricing. In general, the smaller telephony hardware suppliers tend to be less expensive and much more accommodating when it comes to technical support plans, and if you’re new to computer telephony you will definitely require support.
On the other hand, smaller vendors may not support all cards and protocols. The most popular, i.e. analog T1/E1, ISDN-PRI, H.323, R2-CAS, etc. are a must. However, it’s also worth paying attention to the less obvious capabilities, such as SIP, PBX set-emulator cards and transfer-on-CO protocols like TBCT, RLT and Q.SIG. Again, make sure that your middleware framework allows experimenting with different cards and protocols from multiple vendors, and keep in mind that your requirements may change with time.
PBX Integration – Most analysts predict that call centers will account for the biggest slice of the “speech market pie” in the coming years. If your organization is engaged in the call-center market, you know that developing applications for just one PBX-brand is not enough. Given that speech is so much more expensive than touchtone, PBX vendor-independence is becoming more important than ever. Again, make sure that your development tools or middleware offer a good PBX integration story.
What does this mean in practical terms? Today, many PBX vendors are changing their traditional proprietary architectures and have begun opening up to third-party applications. Since PBXs are built for the enterprise (with its strong Microsoft presence), this trend is most visible in Windows, where in recent years we’ve observed a renewed interest in TAPI as the integration technology of choice. As a result, you may expect a complete and well-tested TAPI Service Provider for almost any switch, especially for modern IP-PBXs. In our opinion, building your speech application on TAPI is the best strategy for widening your customer base and consequently maximizing your ROI in the call-center market.
3.4 Conserve your Resources
As noted earlier, ASR and TTS licenses are the most expensive, yet also the most misused resources in speech applications. Some commercially available platforms use two or more ASR licenses per application port, particularly in multilingual or hot-word applications. Below we present a few practical guidelines for saving money:
Royalty Free Engines – Yes, they are available! Companies like Microsoft and Aculab offer license-free ASR and TTS technologies of high quality that may be perfect for your application. One word of caution: customers may accept a lower quality TTS (as long as the message is understandable), but they have much less tolerance for imperfect ASR. From our experience, speech recognition has to work close to perfectly, or it will be deemed useless and dropped. In other words, carefully evaluate your ASR alternatives.
As with the other elements of your application, maintaining vendor-independence works to your advantage, allowing you, for example, to experiment with free engines before deciding to spend your dollars on licenses. (As a side note, none of the speech vendors that we know of accepts returns of purchased licenses). Again, picking a middleware framework that supports both free and commercial engines is, in our opinion, the best strategy.
One license per channel - If you decide to use a commercial speech engine from one of the industry leaders, invest some time to properly engineer your license manager – a lot of money can be saved by this effort. There is no technical reason to use more than one engine license (TTS or ASR) per application channel. Even systems using multiple languages or parallel grammars to implement hot-words can be designed to use one license per channel at any given time. Make sure that your middleware doesn’t force you to unnecessary double your resources.
Floating licenses – You should also keep in mind that many applications don’t require ASR and TTS for the whole duration of a call. As an example, consider a pre-paid calling-card system. It uses speech recognition to identify callers, checks account balances and then bridges calls to outbound trunks. In this scenario, speech resources (ASR and TTS) are only required for a small fraction of a call, possibly as low as 10%. Once a call is bridged, the resources can be redeployed to serve other channels -- this presents a great savings opportunity. If licenses could float dynamically between channels, in theory the savings could be as high as 90%. Unfortunately, not many platforms allow floating speech resources, but some systems do. Given the savings, it pays to ensure that the tool you choose allows for proper license management to take advantage of the specific calling patterns in your application.
3.5 Don’t Skimp on Tuning!
Any non-trivial speech application requires extensive testing and tuning, much more than a traditional touchtone system. This aspect is new, and often comes as a surprise to designers coming from an old IVR background.
Tuning is much more than just tweaking grammars. It is an iterative process of analyzing system performance and repeatedly applying the best design practices in order to arrive at the most satisfying user experience and in order to work around technology imperfections. As a result, the tuning phase can take many months and requires an interdisciplinary team of professionals, including not only developers and testers but also experts in linguistics and often psychology. The resulting cost is substantial, but in our experience this is money well spent.
Planning for tuning is difficult, because speech systems, unlike touchtone, are highly dependent on the demographics, local accents, language mix and even the culture of the target audience. Some applications, such as speech-enabled auto-attendants, may require regular, on-going tuning as the grammar (i.e. a list of employee names) changes over time. Typically, end users are not able and should not be expected to perform tuning themselves and the application should be specifically designed for on-going, remote maintenance by the vendor.
Tuning large grammars, such as a city’s phone book, tends to be particularly challenging and should be approached with special caution. The experienced speech technology providers seem to be well aware of the possible problems: one highly recognized vendor would sell us a ready-to-use grammar containing tens of thousands of names, but would not venture into signing a contract to get it working in the field.
Unfortunately, saving on tuning is not easy and may jeopardize the final quality of your product. Some savings may be achieved by employing off-the-shelf speech component libraries such as Nuance Speech Objects or SpeechWorks Dialog Modules. But in general, tuning is not the area in which to be penny-pinching. In speech recognition applications, users typically have very little patience for shaky technology. The system has to be almost perfect or it will not be used. There is no middle ground.
3.6 Select Platform to Fit Your Application
Over the last few years, we’ve observed two promising trends impacting telephony and speech applications: open source operating systems (mainly Linux) and XML-based scripting languages (mainly VoiceXML and SALT). But before you bet your budget, take a careful look at the cost –the bottom-line ROI of your application is the criteria of success.
Operating System – The choice of operating system fundamentally impacts many aspects of a speech application. Today, there are three main choices: Unix (mainly Solaris), Linux or Windows. While discussing the merits of each OS is beyond the scope of this paper, we will discuss some important considerations specific to telephony and speech.
First, keep your target market in mind. The old bias against Windows still holds strong in some traditional telephony environments, especially among carriers in North America. Even recently, we’ve seen an already completed application being ported to Solaris after approaching carriers with a Windows version. However, other regions of the world regard Windows much more favorably. Even in North America, the situation is much different in the enterprise, where Windows naturally fits into the desktop and business back-end dominated by Microsoft.
The most widely quoted complaints against Windows are reliability and price. We believe that this continued bias is no longer justified – the modern Windows is reliable enough for speech applications, both for carriers and enterprise. As for the price, Windows compares favorably to Unix, and while Linux is free, the price of the OS alone is often almost negligible, especially for systems deployed in small numbers. For example, in the case of the Toronto Airport Assistant (48 lines, Nuance), the price of the Windows operating system was not even 1% of the total cost.
Therefore, the price of the operating system is secondary to the availability of strong development tools, component libraries and middleware. The resulting increase in developer productivity has the potential to far outweigh the savings on the purchase price of the operating system.
Open Standards for Speech – Recent years have brought multiple exciting initiatives to standardize the development of speech applications, including the well-known VoiceXML, SALT, X+V, and CCXML. A discussion of their respective technical merits is beyond the scope of this paper, but there is a wealth of relevant information available from many sources. Similarly, we will not attempt to speculate which standard will ultimately prevail in the future. Instead, we will point out a few less obvious aspects that may impact your immediate strategy today. We will focus on VoiceXML, as it is the only standard in deployment today. The fundamental question is: will your project benefit from using the standard or would you be better off with a proprietary system? Unfortunately, the answer is not always straightforward. Below we present a few ideas to consider.
At first, try to articulate the exact benefits of VoiceXML for your particular application. Next, look at the cost of your respective choices. Request quotes for equivalent VoiceXML and proprietary platforms and analyze them carefully. You will most likely find VoiceXML environments to be substantially more expensive. We received a quote for a typical 24 port VoiceXML Gateway (including hardware, but without ASR or TTS licenses) for $1265 per port. You can build an equivalent system using one of the popular proprietary RAD tools at half this cost.
Debugging your application is another important consideration. We don’t know of any VoiceXML Gateway that offers an IDE supporting the complete environment. Even for VoiceXML alone, the development tools are in short supply. Furthermore, most tools are merely GUI overlays on top of VoiceXML syntax -- good only for creating static pages (as opposed to dynamic pages, which are generated on-the-fly from database queries and program logic). Therefore, before committing to a VoiceXML gateway, ask about source-level debugging, handling of call state machines, multithreading of application code, accessing databases and other basic programming tasks. Our point is that today’s VoiceXML development environment is still primitive when compared to the industry-standard IDEs, like Microsoft Visual Studio.
Finally, take a look at the planned functionality of your application. VoiceXML, by definition, is limited by its own specification. VoiceXML has been designed specifically for speech-based user dialogs and that’s where it excels. If your application is about call control, then beware: your only hope will be proprietary extensions, which in turn ties you to a specific platform, and negates the benefits of vendor-independence and application portability.
So, are we advocating ignoring open standards? No, to the contrary, we strongly believe in the value of open standards and their future wide acceptance. VoiceXML will steadily gain popularity, especially once CCXML addresses the current shortcomings in call control, once the platform prices are reduced and once better developer tools emerge. Similarly, SALT offers a great future promise because of its tight integration with the Microsoft environment (including rich development tools). Until this happens, however, a proper RAD tool can get the job done much quicker and cheaper.
Our recommendation: don’t be afraid of using proprietary environments if they are a better fit for your application and especially when you can realize significant savings. However, make sure that your tools have a well-defined migration path, should the open-standard market develop for your application in the future.
In other words, ensure that your tools either support VoiceXML or are properly integrated with VoiceXML products. You could even consider a hybrid solution to combine the best of both worlds. For example, a front-end node that does a heavy-duty call processing (built on a proprietary system), which calls a VoiceXML gateway to execute best-of-breed third-party VoiceXML components and applications.
3.7 Fallback to Touchtone
Yes indeed, this is the last resort. For the record, we truly believe in the superiority of speech recognition. But, don’t discard the old touch-tone just yet --At the end of the day, reverting to touchtone may cut the cost enough to get your budget approved. Whether we like it or not, many of our customers today opt to save through touchtone, and the fact remains that some applications won’t benefit significantly from speech recognition.
One possible cost-saving strategy is again a hybrid solution, where speech recognition is applied selectively, to the areas, which bring the most benefit to the application. For example, an auto-attendant and voice mail system may be speech-enabled on the customer-facing side and touchtone for the employees.
Make sure that the tools and middleware that you buy are as good for traditional IVR as they are for sophisticated speech applications. Touchtone technology is not going away any time soon.
Speech today is ready for prime time. Thanks to reliable, accurate and commercially available speech engines, many compelling applications became possible, and many have been implemented already. At the same time, we continue to see customers walking away from great speech applications and settling for the old-style touchtone solutions. The primary culprit is cost - in our view the most important factor barring speech from wider acceptance. Unfortunately, the cost will stay high as long as speech remains limited to a niche market. Our industry has yet to come up with a creative way to get out of this impasse.
Nevertheless, we believe that speech applications present opportunities for cost savings, even with today’s high-priced licenses and platforms. This paper has presented a number of practical guidelines for lowering the cost of speech by properly selecting tools and technologies. We hope that applying these guidelines will help you to build a better business case for speech on your next project.
API - Application Programming Interface
ASR - Automatic Speech Recognition
BRI - Basic Rate Interface
CAGR - Compound Annual Growth Rate
CO - Central Office
GUI - Graphical User Interface
HMI - Human-Machine Interface
IDE - Integrated Development Environment
ISDN - Integrated Subscriber Digital Network
OS - Operating System
PBX - Private Branch Exchange
PRI - Primary Rate Interface
RAD - Rapid Application Development
RLT - Release Link Transfer
ROI - Return on Investment
TAPI - Telephony Application Interface
TBCT - Two B-Channel Transfers
TTS - Text to Speech
VUI - Voice User Interface
Predictive Dialer Software and Call Center Predictive Dialers - Database Systems Corp. provides CRM call center software plus CTI phone systems, predictive dialers, IVR systems, ACD systems, Voice Broadcasting systems.