311 30 2MB
English Pages 69 Year 2006
Computers & Security Editor-in-Chief Dr Eugene Schultz, CISSP Chief Technology Officer High Tower Software 26970 Aliso Viejo Pathway Aliso Viejo, CA92656, USA Email: [email protected]
Editor
Academic Editor
IFIP TC-11 Editor
Nova Dudley-Gough Elsevier E.A.T, 18.139 Radarweg 29 Amsterdam 1043 NX Netherlands Email: [email protected]
Prof. Eugene Spafford Professor and Director Purdue University CERIAS Department of Computer Science 1398 Computer Science Building Purdue University, West Lafayette IN 47907-1398, USA Email: [email protected]
Prof. Dr Dimitris Gritzalis Dept. of Informatics Athens University of Economics and Business 76 Patission Street, Athens GR-104 34 Greece Email: [email protected]
Editorial Board Charles Cresson Wood Independent Information Security Consultant and Author Email: [email protected]
August Bequai Attorney At Law, McLean, Va. Email: [email protected]
Dr Richard Ford Associate Professor Florida Institute of Technology Email: rford@fit.edu
Sarah Gordon Senior Research Fellow, Symantec Security Response Email: [email protected]
Professor William J (Bill) Caelli Head — School of Software Engineering and Data Communications, Queensland University of Technology Email: [email protected]
Leon A M Strous Senior IT-Auditor at the Payment Systems Policy Department, De Nederlandsche Bank Email: [email protected]
Stephen Hinde Group Information Protection Manager, BUPANet Email: [email protected]
Prof. Zhenfu Cao Department of Computer Science and Engineering Shanghai Jiao Tong University Email: [email protected]
Publisher David Clark
Marketing Ursula Culligan
Editorial Administrator Vicky Barker
PUBLISHED 8 ISSUES PER YEAR Orders, claims, and journal enquiries: please contact the Customer Service Department at the Regional Sales office nearest you: Orlando: Elsevier, Customer Service Department, 6277 Sea Harbor Drive, Orlando, FL 32887-4800, USA; phone: (877) 8397126 or (800) 6542452 [toll free numbers for US customers]; (+1) (407) 3454020 or (+1) (407) 3454000 [customers outside US]; fax: (+1) (407) 3631354 or (+1) (407) 3639661; e-mail: [email protected] or [email protected]; Amsterdam: Elsevier, Customer Service Department, PO Box 211, 1000 AE Amsterdam, The Netherlands; phone: (+31) (20) (4853757); fax: (+31) (20) 4853432; e-mail: [email protected]; Tokyo: Elsevier, Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg, 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail: jp.info@ elsevier.com; Singapore: Elsevier, Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail: [email protected] © 2006 Elsevier Ltd.
www.elsevier.com/locate/cose
Number 8
November 2006
Contents Predicting the future of InfoSec E. E. Schultz
553
Security views
555
Tightening the net: A review of current and next generation spam filtering tools J. Carpinter and R. Hunt
566
Expected benefits of information security investments J. J. C. H. Ryan and D. J. Ryan 579 A virtual disk environment for providing file system recovery J. Liang and X. Guan
589
Wavelet based Denial-of-Service detection G. Carl, R. R. Brooks and S. Rai
600
computers & security 25 (2006) 553–554
From the Editor-in-Chief
Predicting the future of InfoSec5
I recently participated in a project conducted by the SANS Institute in which a group of infosec professionals was asked to predict infosec trends of the future. Interestingly and not surprisingly, the prediction that laptop encryption will become mandatory within US government agencies and other organizations that store personal and financial data on computing systems and will be built into new computers ranked first (i.e., the most likely to occur). Another less highly ranked prediction was that that a new worm or virus will infect thousands of computers sometime in the futuredhumorous evidence that the ‘‘P.T. Barnum effect’’ is still very much alive and well. Although numerous predictions about the kinds of information security-related events that will occur in the future continually surface, what about the future of information security itself? Donn Parker, a true infosec pioneer, once characterized the practice of information security as ‘‘a folk art’’ in contrast to other better-established, more systematic professional disciplines. Will infosec ever break out of whatever shackles that have held it back for so long? The answer is almost certainly sodin fact, this has already been happening over the last few years. Organizations’ infosec-related spending has according to available statistics generally continued to grow year-by-year, something that has not been true of the IT arena as a whole. Senior management is slowly but surely increasingly appreciating the value of infosec in protecting organizations’ information assets. I will go on record as predicting that this trend will continueda sign of growing acceptance and success. Other factors not directly related to the practice of infosec per se are, however, also likely to propel infosec increasingly towards success. In particular, the continued growth of crime, not just computer-related crime per se, is likely to contribute to infosec’s growth because so much evidence concerning acts of crime resides in computers. A tax evader’s computer is, for example, likely to contain incriminating information that law enforcement agencies would dearly love to have. Computer forensics specialists are best prepared to discover and preserve such information. Computer forensics is thus likely to grow by leaps and bounds in the future, thereby
5
bolstering the credibility and leverage of the practice of information security as a whole. Additionally, unauthorized keystroke sniffers currently abound, enabling perpetrators to steal passwords, Social Security numbers, credit card numbers, personal identification numbers, and more. The unfortunate aftermath has been a rapid proliferation of identity thefts. In the future, however, I predict that keystroke sniffers will be used for even more sordid purposes. Perpetrators will increasingly use them in extortion attempts. Individuals involved in extramarital affairs will, for example, become potential victims in extortion plots because perpetrators will capture the keystrokes they enter when they correspond with their extramarital partners and then attempt to get those whose keystrokes were recorded to pay to avoid having the contents of their messages exposed. Encrypting email traffic will in such cases do no good in protecting them from extortionists even though encrypting email is, all things considered, an excellent security practice. Regulatory and compliance considerations have already bolstered information security considerably, but I expect these considerations to become considerably more numerous and intense in the future. The reason is that more regulatory/ compliance legislation that requires particular infosec practices within organizations is likely to be passed in an attempt to stem the tide of software, movie and music piracy, data security breaches leading to identity theft, and the like. This will prove even more helpful to the growth of information security in the future. Finally, the practice of infosec has if anything increasingly been centered on security risk management. Risk management in general has grown substantially in the business world over the last five to ten years, and infosec professionals have been able to fill in pieces of risk management dilemmas that other professionals have not been able to do. Given the extreme importance of risks related to information resources, infosec professionals are likely to hold an exceptionally attractive set of cards. Once again, the future of the practice of infosec appears to be very promising.
The opinions in this editorial are entirely those of the author, not of High Tower Software. They do not in any way represent High Tower Software’s position on the issues that are addressed.
554
computers & security 25 (2006) 553–554
In short, infosec has increasingly proven its value to senior management and stakeholders, and additional factors and considerations are only likely to boost the practice of infosec. There is thus genuine cause for growing optimism concerning the future of infosec.
Dr. E. Eugene Schultz, CISSP, CISM E-mail address: [email protected] 0167-4048/$ – see front matter ª 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.cose.2006.10.010
computers & security 25 (2006) 555–565
Security views
1.
Malware update
A virus-worm hybrid, Worm.Mocbot.a, has spread prolifically in Shanghai. Exploiting a vulnerability in Windows XP to infect systems, this malware uses a chat network to gain control of victim systems and then gleans passwords and financial data. Infected systems often become so unstable that they may not even be able to connect to the Internet. Shanghai’s anti-virus support center has responded to more than 800 calls for help. A warning message reading ‘‘Generic Host Process for Win32 Services’’ with many words printed in Chinese characters indicates that Worm.Mobbot.a has infected a computer. For a brief period Samsung Electronics’ US Web site contained a Trojan horse program that recorded keystrokes and stopped anti-virus software on systems on which it was installed. This program was not automatically injected into Web site visitors’ computing systems; user interaction was instead required. Samsung removed this program from its Web site as soon as it was informed of the problem. Another zero-day vulnerability that could be exploited to allow remote code execution in Microsoft Word 2000 has been found. This vulnerability can be exploited when users running vulnerable versions of Word 2000 open Word documents containing exploit code. Several instances of such code are already in the wild. Trojan.Mdropper.Q, for example, installs two pieces of malicious code on computing systems that it infects, both of which are related to Backdoor.Femo, a backdoor Trojan program capable of process injection in victim systems. A new AOL Instant Messenger (AIM) worm, Win32.Pipeline, is spreading and appears to be trying to build a botnet. It infects computing systems whenever users are fooled into downloading an executable file that appears to be a JPEG image. Once Win32 Pipeline is installed and running, it establishes connections with numerous remote computers to download malicious code into the infected system. As far as malware goes, once again very little has changed since the last issue of Computers and Security. Malware writers continue to write programs that are difficult to detect so that these programs can accomplish the authors’ goals before the programs are discovered and eradicated. Malware writers are going for money, not fame, a trend that is likely to last through the foreseeable future. As such, many deadly Trojan and other programs undoubtedly once again remain undetected.
2.
Update in the war against Cybercrime
A lawsuit filed in the state of Washington against Movieland.com parent company Digital Enterprises alleges that breaches of the state’s Computer Spyware and Consumer Protection Acts have occurred. People are enticed to order a free, threeday trial of the company’s software that enables them to download movie clips. After the end of the trial period, popup messages that appear every hour or sometimes even more often demand payment. The pop-ups remain on the screen for 40 seconds and cannot be closed during that time. The pop-ups are created by clandestine software installed on users’ computers without their consent. Romanian law enforcement has arrested 23 people who are accused of being part of an Internet fraud ring. These individuals allegedly created fraudulent Web sites that captured users’ email addresses and then asked them to update their personal information. The information was allegedly subsequently used in offering fictitious items over the Internet. The accused individuals allegedly defrauded individuals out of more than USD 120,000. Anyone who is convicted could receive a sentence of up to 15 years of imprisonment. A US federal grand jury has indicted Jovany Desir, a Floridian, on five counts of wire fraud. He allegedly set up Web sites appearing to be American Red Cross, PayPal, eBay, and several banks’ sites designed to trick individuals who donated money for Hurricane Katrina victim relief into revealing their financial information. If convicted, Desir could get a prison sentence of up to 50 years and a USD 1 million fine. Internet-related crime in Japan has according to statistics from Japan’s National Police Agency recently grown substantially. Figures indicate that 1802 reported cases of Internet crime occurred during the first half of 2006, a 12% increase compared to the same period in 2005. Internet-related fraud constituted the largest proportion of reported Internet crime with 40% of the reported cases. Unauthorized network access, including phishing ploys and illegal access to bank accounts, constituted 265 of the reported cases, a 34% jump compared to the 2005 statistics. David Lennon pleaded guilty to violating Section 3 of the UK’s Computer Misuse Act (CMA) for launching a denialof-service (DoS) attack in which approximately five million messages were sent to his former employer’s mail server
556
computers & security 25 (2006) 555–565
two-and-a-half years ago. The barrage of messages crashed the server. He was sentenced to two months of curfew, the terms of which require him to be home by 12:30 a.m. and remain there for a set period of time. In 2005 charges against Lennon were dismissed; a judge ruled that he had not violated the CMA, but the Crown Prosecution Service appealed the ruling. Lennon’s case has led to efforts to update the CMA to cover a broader range of computer-related crimes. The US Securities and Exchange Commission (SEC) is suing a Connecticut married couple, Jeffrey Stone and Janette Diller Stone, for using spam to pump up the price of stock they had bought. The couple allegedly subsequently dumped the stock once its value momentarily rose. They allegedly made USD 1 million off of the scheme. The government of the Peoples Republic of China (PRC) has fined Hesheng Zhihui Enterprise Management Consulting for spamming. This company was fined 5000 yuan for sending massive amounts of email containing unsolicited advertisements to Internet users and additionally was told to immediately stop sending spam. A PRC anti-spam regulation went into effect earlier this year; it required organizations sending commercial email to offer a way for recipients to agree to or decline receiving subsequent messages. The fine imposed on the consulting firm that sent spam was the first ever in the PRC. Additionally, authorities in the PRC are targeting Web sites that breach this country’s new copyright laws that went into effect last July 1. More than 100 Web sites, including some that make movies and music available for free, have been closed. AT&T is suing 25 data brokers on the grounds that they used pretexting, the practice of setting up bogus on-line accounts to gain access to information, to obtain approximately 2500 AT&T customers’ call records. AT&T says that affected customers have been informed. Christopher Maxwell of California has been sentenced to more than three years of imprisonment with three years of supervised release afterwards. He created a botnet that infected millions of computers worldwide in an attempt to reap profits from installing spyware on compromised machines. Earlier this year Maxwell pleaded guilty to one count of plotting to deliberately harm a protected computing system and one of causing damage to a computer that impeded medical treatment. According to the FBI, Maxwell and two as yet unidentified accomplices made more than USD 100,000 from their illegal activities. The bots disrupted computers at numerous organizations, including the US Department of Defense (DoD), a California school district, and a Seattle hospital. Maxwell has also been ordered to pay more than USD 250,000 in restitution to the DOD and Seattle’s Northwest Hospital and Medical Center. Jason Arabo of Michigan has received a prison sentence of 30 months and must pay more than USD 500,000 in restitution for plotting to attack business competitors’ Web sites. Arabo sold classic sports team jerseys on-line; by his own admission he hired Jasmine Singh to perpetrate denial-of-service (DoS) attacks against Web sites operated by others who offered similar merchandise. Singh was recently sentenced to serve five years of prison time and to pay USD 35,000 in restitution. Danny Ferrer of Florida must serve six years in prison and pay restitution of more than USD 4.1 million for trafficking in
pirated software. Ferrer, who pleaded guilty to conspiracy and criminal copyright violation several months ago, operated a Web site in which copies of popular programs were available at low prices. Affected companies may have been cheated out of up to USD 20 million in lost sales. Ferrer has consented to appear in public service announcements regarding software infringement. He must also forfeit motor vehicles, boats, and airplanes bought with sales from his Web site and must complete 50 hours of community service. Nicholas Lee Jacobsen of California has been sentenced to one year of home confinement and must also pay USD 10,000 for gaining unauthorized access to a T-mobile computer and then accessing records containing the names and Social Security numbers (SSNs) of about 400 T-mobile customers. The break-in occurred two years ago. Jon Paul Oson of California has pleaded not guilty to charges that he harmed protected computing systems. Oson was formerly employed at San Diego’s Council of Community Health Clinics, but reportedly resigned after getting a sub-par evaluation. Afterwards he allegedly gained unauthorized access to computers at two Southern California health clinics and deleted patient and billing information. A number of patients did not receive needed services because of the attacks. Oson is being held in lieu of USD 75,000 bail. If found guilty of the charges against him, he could get up to 20 years of jail time and fines of up to USD 500,000. Sulagna Ray, an employee of Jaishree Infotech in eastern India, has been arrested on fraud charges. She allegedly used credit card information she collected while she sold dish TVs to customers in the US to buy Internet goods valued at Rs. 1.8 lakh. The motive was reportedly to ‘‘have fun.’’ The Virginia Court of Appeals has upheld Jeremy Jaynes’ conviction for sending spam to AOL customers. Two years ago Jaynes was convicted of breaking Virginia’s anti-spam; he was subsequently sentenced to nine years of prison time. Jaynes’ lawyers argued that the Virginia court did not have jurisdiction in this case because the spam was sent from a computer in North Carolina instead of in Virginia, where AOL servers reside. The defense lawyers also contended that Virginia’s anti-spam law violates the right of free speech. Jaynes remained free on bond pending the appeal, but Virginia’s Attorney General asked that the judge revoke the bond and force Jaynes to go to jail. Jaynes’ lawyers have stated that they will file a further appeal. Eric McCarty of California has pleaded guilty to the charge of accessing a protected computer without authorization. He admits he accessed the University of Southern California’s (USC) admission application system last year and pilfered seven students’ personal data after USC rejected his bid for admission. He will be sentenced in December. Having pled guilty to a felony count, he is likely to receive six months of home detention and then three years of supervised release. Additionally, he will probably be ordered to pay restitution of about USD 37,000. McCarty told the press that he found a security hole when he was applying to USC on-line. Nathan Peterson of California has received a sentence of 87 months imprisonment and a fine of USD 5.4 million for software piracy. The length of his prison sentence constitutes a new record for software piracy in the US. Late last year Peterson pleaded guilty to charges of criminal copyright
computers & security 25 (2006) 555–565
infringement. He sold pirated copies of software using a Web site and by sending email messages; he reportedly earned USD 5.4 million (the amount of his fine) by doing so. The FBI closed down Peterson’s operation early last year. A Queensland Australia company has suffered a tarnished reputation due to a spam attack in which its name was spoofed by individuals who do not even live in Australia. The National Online Talent Management (NOTM) agency’s customers and others who were not familiar with this company have sent large numbers of hostile email messages to the NOTM agency complaining about the spam. The bogus email had contained large amounts of text from a bona fide NOTM message. NOTM is trying to figure out a way to restore its public image. Two California companies and three individuals have consented to pay a USD 2 million fine to settle the Federal Trade Commission’s (FTC’s) charges of false and deceptive practices. Enternet Media, Conspy, Baback Hakimi, Lida Rohbani, and Nima Hakimi conducted a ploy in which they advertised virus and spam eradication, but in reality infected computing systems of users who responded to their offer with malicious code. Users saw a pop-up that advised them of problems with their browsers and offered to fix them for free. Turning down the offer resulted in the pop-up ad being displayed. People who downloaded the so-called fix caused difficultto-remove spyware and tracking software to be installed on their machines. The accused also deployed other tactics such as offering free music files, mobile phone ring tones and wallpaper to trick people into downloading the malware onto their computers. The conditions of the settlement permanently prohibited the accused from interfering with users’ computers. As many as 18 million computers around the world may have been infected with the malicious code. Fernando Ferrer, Jr. and Isis Machado, both of Florida, were indicted on charges of conspiracy to engage in computer fraud, conspiracy to perpetrate identity theft and conspiracy to illegally disclose individually identifiable health information. They also face fraud-related charges in connection with misusing computers and violating of the Health Insurance Portability and Accountability Act (HIPAA). Ferrer and Machado allegedly conspired to pilfer personal medical data pertaining to more than 1100 Cleveland Clinic Florida patients and exploiting it to make more than USD 2.8 million from bogus Medicare claims. The Cleveland Clinic has informed patients whose information was pilfered of the data security breach. If convicted of the charges against them, Machado and Ferrer could each get a maximum of 10 years of imprisonment and a fine of USD 250,000. Farid Essebar and Achraf Bahloul, both from Morocco, have been sentenced to prison for their activities related to the Zotob worm. This worm was released in August last year; it infected numerous computers, including computers belonging to the Associated Press, ABC, CNN, the New York Times, and the US Immigration and Customs Enforcement Bureau. Essebar and Bahloul received a two-year and one-year sentence, respectively. Defense attorneys for both men said that they plan to file an appeal. Microsoft has won a civil suit against a Paul Fox, a UK spammer. A court has ordered him to pay GBP 45,000 to Microsoft for violating the terms and conditions of using Microsoft’s
557
Hotmail service; among other things, these terms and conditions forbid sending spam to Hotmail users. Microsoft filed the civil suit because UK law enforcement did not pursue prosecution on the basis that UK anti-spam legislation is narrow in scope. KSTM LLC, a company specializing in sending bulk email, has been ordered to pay Earthlink USD 11 million for spamming Earthlink customers in violation of the US CAN-SPAM Act. The judgment, rendered by a federal court in Atlanta, also forbids KSTM LLC from inserting bogus information in the ‘‘from’’ fields in email messages, obfuscating the identity of the sender, selling email addresses, and interacting with or obtaining Earthlink accounts. A US federal judge has ordered UK-based Spamhaus, a ‘‘spam busting’’ company, to pay USD 11.7 million in damages to e360 Insight LLC. Spamhaus has identified e360 Insight LLC as a spam source, resulting in Spamhaus’ decision to block all emails sent by this company. The judge ordered Spamhaus to quit blocking email from this company and to post an apology on its Web site. Spamhaus plans to defy the judge’s order, saying that e360 Insight LLC is a bona fide spammer under UK spam laws and that US legal jurisdiction does not apply to UK organizations. Xinet, the Peoples Republic of China’s (PRC’s) second largest domain name service (DNS) provider, experienced a DoS attack that took down 180,000 Web sites over eight hours. Among the many sites that were taken down was the Shanghai Daily site, which contained information about the attack. Traci Southerland of Ohio has received a prison sentence of 13 years for stealing personal data from the Hamilton County, Ohio Clerk of Courts’ Web site and then using the data to perpetrate credit card and check fraud amounting to USD 500,000. The Web site has been changed such that it blocks access to documents that contain personally identifiable information. Six individuals have been arrested on fraud charges for their alleged participation in a phishing ploy in which credit card and bank account numbers were stolen from AOL users. The individuals allegedly gleaned thousands of AOL account addresses and then mailed ecards that downloaded programs that kept users from logging on to AOL without first entering credit card and/or bank account information. The perpetrators allegedly bought computers, gift cards, and gaming systems using the pilfered financial information. Three individuals have already pleaded guilty and face sentencing; they are likely to have to serve somewhere between two and nine-and-a-half years of prison time. Three others have still not been arraigned. Microsoft is suing a yet unidentified programmer who created and released a program that defeats digital rights management (DRM) copy protection in Windows media files. Microsoft has released several patches to counter the program, but the individual has in each case released an updated version of the program that circumvents the patches. Microsoft wants unspecified damages and a permanent injunction against developing software that defeats DRM copy protection. The suit claims that the programmer obtained Microsoft source code without authorization, an allegation denied by the accused. Yvon Hennings, a contractor at the Stevens Hospital emergency room in Edmonds, Washington, pilfered patients’ credit
558
computers & security 25 (2006) 555–565
card numbers and then handed them over to her brother, who used them to buy a large amount of goods on the Internet. Hennings has pleaded guilty to conspiracy to perpetrate access device and wire fraud; she will be sentenced soon. Her brother will go on trial early next year. The number of computer crime-related news items in each Security Views column is steadily growing issue-by-issue. My count indicates that there are 28 such items in the current issue, a new record. The trend towards a growing number of arrests, trials and convictions for computer crime-related activities continues, too. Law enforcement and the legal system have targeted both individuals and crime rings, and where law enforcement is unable or unwilling to arrest criminals, organizations such as Microsoft and Earthlink are increasingly filling in the gap by filing lawsuits. Not too many years ago computer criminals could operate with much greater boldness; there was little to fear when they engaged in their sordid deeds. They must now be more clever and clandestine to avoid being prosecuted or sued, and the probability that they will have to face the consequences of their actions seems to be continually increasing.
3. More compromises of personal and financial information occur
Compromises of personal and financial information continue to occur at alarming rates. Computer theft and loss of computers continue to be one of the major reasons for such compromises, as shown by the following news items: Another US Department of Transportation (DOT) laptop system was stolen during an agency-sponsored conference last spring. The computer, which was assigned to the special agent in charge of the Miami, Florida office, held cleartext case file information. DOT’s Inspector General has not resolved the issue concerning whether or not the laptop stored personally identifiable information. Another Department of Veterans Affairs (VA) laptop computer, one that contains personal information pertaining to nearly 20,000 US veterans, was lost but then later found. A reward of up to USD 50,000 had been offered for information leading to its recovery. Unisys, a company contracted to monitor insurance claim processing data for VA, had the computer. Khalil Abdullah-Raheem, a temporary Unisys employee, was arrested in connection with the theft. He was released after he posted a USD 50,000 bond. The FBI is determining whether or not the data stored on the laptop were compromised. Williams Sonoma has notified approximately 1200 current and prior employees that their personal information was stored on a computer that was taken from the apartment of a Deloitte & Touche employee. None of this information was encrypted. Deloitte & Touche was performing an annual audit of Williams Sonoma’s financial statements. Local law enforcement is performing an investigation of the incident. Chevron Corp. conceded that a laptop stolen from an independent contractor holds personally identifiable information pertaining to an undetermined number of current
and prior Chevron employees. Chevron has started to inform individuals whose data were on the stolen laptop and has offered them free credit monitoring and identity restoration services. Chevron also sent an email to all its employees to inform them of the data security breach and to boost awareness of the need for data security. Ten laptop computers that hold personally identifiable information of Medicare and Medicaid patients who have received treatment at Hospital Corporation of America (HCA)-managed hospitals in eight states were stolen from HCA offices. HCA is conducting an internal review and the FBI has been called in to investigate. A laptop computer that contained personally identifiable information, including names, SSNs and medical insurance information pertaining to more than 28,000 Home Care patients of Beaumont Hospitals, was stolen from the car of a nurse in Detroit. So far no evidence of misuse of the information exists. The laptop’s stored information was both encrypted and password-protected, but the nurse’s access code and password also fell into the wrong hands when the computer was taken. The login connection for the stolen computer was deleted. The laptop was found and returned three weeks after the theft. A laptop system with personally identifiable information pertaining to over 600 American Family Life Assurance Co. (Aflac) policyholders was pilfered from an Aflac agent’s car. Aflac informed people who were potentially affected by the data security breach in a letter. The stolen laptop had tracking technology installed. Aflac has set up a call line for potentially affected customers. Local law enforcement has been called in. The Federal Motor Carrier Safety Administration, a division of the DOT, has announced that a laptop stolen from a government car may contain personally identifiable data pertaining to nearly 200 people who have commercial driver’s licenses. The incident occurred last August in Baltimore, Maryland. This data security breach potentially affects 40 motor carrier companies. Law enforcement has been notified. Two laptop systems stolen from the Washington, DC offices of professional services contractor DTI last August contained the SSNs of 43 Department of Education employees who were evaluating grant applications for the teacher incentive fund. The data were not encrypted. Almost all of the potentially affected people have been notified, as were law enforcement and the Department of Education. Security cameras recorded the actions of a suspect in the theft. A reward for the return of the stolen computer has been offered. A laptop computer pilfered from the house of a contractor for the city of Chicago stored personally identifiable information, including names and SSNs, of thousands of city employees. Nationwide Retirement Solutions (NRS) is informing people whose information was on the stolen computer and will offer them one year of free credit monitoring and USD 25,000 of free identity theft insurance. NRS has used encryption on all laptop systems since the incident occurred. A computer taken from a medical lab’s sample collection center in New Jersey contains personally identifiable
computers & security 25 (2006) 555–565
information of patients. LabCorp mailed letters to inform people whose information was on the computer, which was stolen early last spring. The information includes names and SSNs, but not the results of lab tests. Wells Fargo has informed certain of its employees that their personal information, including names, SSNs, and medical insurance and prescription drug information, may have fallen into the wrong hands because a laptop system and a hard drive were stolen from an audit company employee’s car. The audit company was appraising Wells Fargo’s health plan information in accordance with Internal Revenue Service (IRS) requirements. Wells Fargo terminated the services of the audit company afterwards. Two laptop systems that contained data pertaining to over 13,000 prior and current systems were pilfered from an office at the University of Minnesota. The data include names, dates of birth, aptitude test scores, academic probation information, the SSNs of some of the students, and more. Affected individuals were notified of the data security breach. A university spokesperson said that storing the data on a hard drive, as was the case in the stolen laptop, is not a standard operating procedure. A laptop system stolen from an Ottawa branch of the Bank of Montreal contains personally identifiable information pertaining to approximately 900 bank customers. A bank spokesperson announced that no evidence that the information has been misused exists. BMO Bank of Montreal has recommended that potentially affected customers closely monitor their accounts for potentially suspicious activity. A laptop system stolen from the car of a Florida National Guard soldier stored personally identifiable information pertaining to up to 100 Florida National Guard soldiers. The incident led to a Florida National Guard security review. A laptop system pilfered from an employee of auditor Morris, Davis & Chan held cleartext, personally identifiable pension plan information that included names and SSNs of employees from Howard, Rice, Nemerovski, Canady, Falk & Rabkin, a San Francisco law firm. The incident potentially affected approximately 500 people, all of whom were notified about the incident. Two computing systems stolen from the Radiation Therapy Department at DePaul Medical Center in Norfolk, Virginia stored information pertaining to approximately 100 patients. The hospital is informing everyone who has been potentially affected. The US Department of Commerce (DoC) has conceded that 1137 of its laptop computers have been lost or pilfered since 2001 and that 249 of them hold personally identifiable information. Some of the computers were password-protected; some others had data encryption. DoC Secretary Carlos Gutierrez said that approximately 6200 households could be affected by the thefts. A laptop taken from the hotel room of a General Electric (GE) employee contains the names and SSNs of approximately 50,000 current and prior GE employees. A GE spokesperson said that this company is offering all potentially affected people one year of free credit monitoring. A Nagasaki University official has stated that six laptop computers holding personal information pertaining to
559
approximately 9000 patients were stolen from the Nagasaki University Hospital of Medicine and Dentistry. Names, birth dates and medical diagnoses were among the types of information that may have fallen into the wrong hands. Eight USB memory sticks and two hard drives were also taken. Law enforcement has been notified of the thefts. Computers taken from the Kenya Revenue Authority (KRA) held income tax return information. Other Kenyan offices have also recently experienced similar thefts. The thieves have reportedly been targeting recent models of computers, and are not interested in any data on them. The Leeds School of Business at the University of Colorado is notifying over 1300 current and former students that their names, SSNs and grades are stored on two computing systems that have disappeared. One of the computers has since showed up. The university has set up a hotline to answer questions from individuals who are potentially affected. A laptop system stolen from the car of a financial services company employee in Edmonton, Alberta contains personal information of 8000 area medical doctors. The company, MD Management Ltd., has informed the doctors of what happened. Alberta’s Office of the Information and Privacy Commissioner said that MD Management Ltd. did not adequately safeguard the information from theft. A computer taken from a North Carolina Department of Motor Vehicles (DMV) office holds personal information pertaining to approximately 16,000 driver’s license holders in the state. The information includes names, addresses and SSNs of individuals who were recently issued driver’s licenses. No indication that the information has been misused exists; the DMV has informed everyone who has been potentially affected.
Other compromises were the result of unauthorized remote access to systems, as described below. A security incident in one of the University of South Carolina’s servers may have exposed personal information pertaining to 6000 current and prior students. All potentially affected individuals have been informed of the incident, which took nearly one year to identify. There is no evidence of identity theft attempts resulting from the incident. Law enforcement has not been contacted concerning the incident. AT&T has admitted that personal and financial information pertaining to approximately 19,000 customers who used the company’s on-line shopping site to subscribe to DSL services has been compromised. The perpetrators used the information they stole to initiate phishing attacks in which email messages directed recipients to visit a certain Web site to update their credit card information. AT&T informed credit card companies of the security breach immediately after learning of it and has also informed potentially affected customers. AT&T is also cooperating with law enforcement. Players of Second Life, members of a virtual community, have been requested to change their passwords after a perpetrator gained unauthorized access to a database that stored personal information pertaining to all 650,000 members of this community. The compromised information
560
computers & security 25 (2006) 555–565
includes names, addresses, passwords and encrypted credit card numbers. Members were notified of the breach shortly after it occurred. Purdue University is informing about 2500 people who were students in this school in 2000 that their personal information may have fallen into the wrong hands. The information includes names and SSNs. A security check of an administrative computer in the University’s Chemistry Department revealed that a perpetrator may have illegally accessed a file containing this information. Purdue has set up a toll-free number to call if individuals think they may be affected by the incident. A technical team at the University of Texas at San Antonio (UTSA) and local and state officials are looking into an incident in which a server at the university was compromised. Information that includes names, addresses and SSNs pertaining to 53,000 current and prior students who have received financial aid and 11,000 current and prior faculty and staff members are stored in this server. All potentially affected individuals have been notified of the security breach. A university spokesperson said quick action prevented information from actually being stolen. Intrusions into a server at the University of Alaska, Fairbanks (UAF) Bethel campus that may have started as early as in February 2005 are still being investigated. Personal information pertaining to nearly 39,000 current and prior students was compromised. FTP servers on the compromised machine were configured to allow back door access to the perpetrator(s). The FBI is involved in the investigation. Additional personal data exposure incidents were due to poor protection of personal information on Web sites, as described below: Personal data pertaining to about 21,000 students who borrowed money for college from the US government have been compromised on the US Department of Education’s Web site. To prevent further compromises of such information, parts of the software that led to the compromise have been disabled. The US Department of Education if offering free credit monitoring students who have been potentially affected. Portland, Maine law enforcement is investigating a compromise in the PortTix Web site in which credit card information belonging to approximately 2000 Merrill Auditorium patrons was stolen. Potentially affected individuals were informed of the incident. A phone call from an anonymous individual tipped the site’s operators off that the compromise had occurred. The names, addresses and credit card numbers of 3235 Nikon World magazine subscribers were publicly available on the Internet for about nine hours last week. Someone who attempted to subscribe to the magazine on-line discovered that subscriber information was accessible by clicking on a link in an email message that Nikon World sent. The data exposure affects only individuals who subscribed to the magazine after January 1, 2006; they were all notified of the data security breach. A database containing the names, addresses and credit card numbers of more than 9000 Life is Good customers has been
accessed without authorization. A company spokesperson said potentially affected customers were informed of the incident soon after it occurred. Access to the Web site was promptly shut down and additional security measures were put in place. The names, addresses, SSNs and other personal information pertaining to approximately 25,000 gun permit holders in Berks County, Pennsylvania were exposed on the Internet for a while. The Berks County sheriff was trying to secure this information to comply with a court order to do so when the exposure occurred. Pennsylvania state law requires notification of all persons whose personal information has been exposed. Several data compromises were due to missing or stolen media: A USB storage device missing from a locked office holds the names and SSNs of approximately 4150 current and prior Erlanger Hospital, Tennessee employees who had recently undergone employment status changes. Potentially affected individuals were promptly informed of the incident. An additional 2050 current employees whose information was not on the missing device were nevertheless also informed of the incident. Backup tapes containing account information of more than 2.6 million Circuit City credit card customers were unintentionally discarded. Chase thinks that the tapes were compacted, destroyed, and buried, but is nevertheless monitoring the accounts of all potentially affected customers. Thirty-one tapes containing information about hundreds of thousands of British Columbia citizens have disappeared from a government office in Victoria, Canada. Government authorities have not notified individuals who are potentially affected; Canadian law does not require such notification. Finally, one data loss incident was due to a combination of factors: Berry College, Georgia officials have been told that sensitive student information (names, SSNs, and reported family income) has been lost. Part of this information was digital, part was printed on paper. A consultant apparently misplaced the information at an airport. Over 2000 students and applicants who submitted an application for Federal Student Aid at the college over the last two years have been potentially affected. They have been notified of the incident, and a hotline and a Web page have been set up to answer questions. The number of data security breaches is proliferating in a manner that closely parallels the proliferation of computer-related crime, and there appears to be no end in sight. As I have said so many times before, as long as individuals and organizations continue to not be held accountable for safeguarding personal and information, they will continue to be negligent, resulting in one data security compromise after another. National legislation requiring suitable levels of protection for personal and financial data and prescribing punishments for those who fail to adequately protect such
computers & security 25 (2006) 555–565
data remains the most realistic course of action. National legislation that requires prompt notification of data security breaches would also go a long way in curbing the fallout from data security breaches.
4. Aftermaths of data security breaches continue Aftermaths of data security breaches that various organizations have experienced continue to occur. The director of network communications services and the manager of Internet Systems for Ohio University were fired in the aftermath of the massive security breach there earlier this year. The CIO also resigned. Three AOL employees connected with the recent exposure of user search query data on the Internet were also fired and the CTO also resigned. But AOL’s problems have apparently just begundthree people whose search records were exposed have filed what they hope to be a class action lawsuit against AOL. The suit claims AOL’s actions were in violation of both the Federal Electronic Communications Privacy Act and California consumer protection laws. Finally, in Australia 19 employees of Centrelink, an agency of Australia’s Department of Human Services, were fired, 92 resigned, and more than 300 may get cuts in pay after accusations of massive privacy breaches there were made. As many as 600 employees may have accessed welfare and other records without authorization. There is a high price to be paid for neglecting security; the above accounts provide strong support for this view. Individuals have been fired and forced to resign in the aftermaths of data security breaches, something that has become a precedent in organizations that experience such breaches. Aftermath stories will hopefully motivate senior management and others to take information security more seriously.
5. Update on computer crime-related legislation Legislation that threatens people who violate copyrights with up to five years of prison time recently went in effect in Russia. The legislation, which was passed over two years ago, is an amendment to Russia’s existing copyright protection law. Its purpose is to stop illegal distribution of text, music and video in MP3 format over the Internet. The amendment gives operators of Web sites that make copyright-protected content publicly available two years to register and to obtain licenses. Russia’s primary motivation is to gain entry into the World Trade Organization (WTO); several countries have refused to admit Russia into the WTO because Russia has been lax in dealing with copyright violations. New draft legislation in Germany would make penetrating a computer security system and gaining access to secure information illegal and subject to punishment, even if the information is not stolen. Additionally, the draft law states that groups that intentionally create, spread or buy hacker tools designed for illegal purposes could be also face punishment. Other punishable computer-related crimes include DoS and other computer sabotage attacks on individuals, not just
561
businesses and public authorities (as the current law on computer crime currently stipulates). Offenders could get a maximum of 10 years jail time for major offenses. The draft legislation is intended to close loopholes in current computer crime legislation. The US House of Representatives passed a bill that amends and strengthens the Federal Information Security Management Act (FISMA) by giving CIOs enforcement authority over IT systems containing personal information. The bill further protects veterans’ personal data and also requires all federal agencies to notify the public when data security breaches involving sensitive information occur. H.R. 5835, the Veterans Identity and Credit Security Act of 2006, will next go to the Senate for approval. H.R. 6163, the Federal Agency Data Breach Protect Act, requires the Office of Management and Budget (OMB) to set up procedures for agencies to follow if personal data are lost or stolen. It also requires that individuals be notified if their personal information could be compromised by a data security breach at a US federal agency. Federal agencies would be required to ensure that costly equipment containing sensitive information is accounted for and secure. Similar but less detailed language from an earlier bill, H.R. 5838, the Federal Agency Data Breach Notification Act, was incorporated into the VA bill during the summer. As amended, H.R. 5835 requires that Congress and people affected be notified if a data security breach transpires. The bill also requires the VA to conduct data breach analysis and, if needed, provide credit protection services and fraud resolution services upon the request of those potentially affected. This identity theft remediation protection may include a credit freeze, identity theft insurance, and credit monitoring. The VA has installed encryption software on 15,000 laptops and is implementing the Data Security Assessment and Strengthening Controls Program to remedy many of its security weaknesses. The legislation that recently went into effect in Russia and the new proposed legislation in Germany look valuable in the fight against computer crime. Weak links in the international effort to combat software and other types of piracy can and do negate efforts in other countries. Russia’s adopting stronger legislation that prescribes strong punishment for piracy should help eliminate one of these weak links, provided of course that Russia engages in a strong effort to enforce the provisions of this legislation. Germany’s attempt to close loopholes in current computer crime legislation will enable German authorities to arrest and try individuals who engage in certain kinds of activity such as launching DoS attacks. My only concern with the proposed German legislation is with the part that prescribes punishment for activities in connection with ‘‘hacker tools.’’ These tools are routinely used by the ‘‘white hat community;’’ I would hate to see a member of this community go to prison for conducting a sanctioned penetration test. Finally, the new bills in the US concerning data security breaches seem on the surface to be steps in the right direction. Once again, my only concern is that federal legislation concerning such breaches is likely to have weaker provisions than are in many existing states’ legislation, thereby undermining the latter.
562
6.
computers & security 25 (2006) 555–565
HP’s woes continue
Hewlett–Packard’s (HP) effort to identify board members suspected of leaking information to the press has done considerable harm to HP’s public image and may ultimately result in one or more criminal convictions. The AT&T home phone records of at least one board member at the time, Tom Perkins, were obtained through pretexting, a form of social engineering in which a person pretends to be the person inquiring about his or her own personal or financial information. Additionally, monitoring programs were covertly installed on suspects’ computing systems. When news of HP’s extreme tactics became public, Perkins resigned in protestdhe had nothing to do with the press leaks. Chairman of the Board Patricia Dunn resigned shortly afterwards. Board member George Keyworth, who has admitted that he was the source of the leaks, resigned later. Both Dunn and CEO Mark Hurd had to testify at an investigative hearing held by the US House Energy and Commerce Committee concerning the legality of telephone pretexting and were also required to submit documents to this committee. To make matters worse, Patricia Dunn and four others now face felony charges for their alleged participation in this scandal. One of the other four accused persons is a former HP senior attorney; the other three were contractors who allegedly obtained others’ phone records without authorization. The California attorney general has charged all five individuals with fraudulent wire communications, wrongful use of computer information, identity theft and conspiracy to perpetrate the aforementioned crimes. I almost decided to not include this news item in this issue’s Security Views column, but changed my mind even though what Ms. Dunn and others have allegedly done is ostensibly not really any kind of violation of computer crime laws. This ugly story nevertheless presents a poignant ‘‘lessons learned’’ to infosec professionals in that it shows what happens when corporate governance breaks down. The practice of infosec and the practice of IT governance go hand-in-hand. It is difficult to envision how things could have gotten so far out of control in HP executive management circles, yet they did, and HP and the individuals involved will likely suffer the consequences of the fallout for some time to come.
7.
Bank fined for buying personal information
Fidelity Federal Bank & Trust has been ordered to pay USD 50 million for buying 656,600 names and addresses from the Florida Department of Highway Safety and Motor Vehicles for use in direct marketing mail advertising car loan offers. The Electronic Privacy Information Center (EPIC), which joined in filing the suit for the plaintiffs, stated that Fidelity Federal Bank & Trust’s purchases were a violation of the Drivers Privacy Protection Act that went into effect in 1994 to protect drivers license holders from having their personal information given to others. According to papers filed in Kehoe versus Fidelity Federal Bank and Trust, from 2000 to 2003 Fidelity purchased data for only USD 5656. The case has gone through numerous proceedingsdin 2004 the US District Court for the Southern District of Florida ruled that James Kehoe had to
show actual damages before being rewarded compensation under the Drivers Privacy Protection Act. Kehoe appealed to the 11th Circuit Court of Appeals, which overturned the ruling. Afterwards EPIC joined the suit, stating that the case dealt with the threat that the sale of information pertaining to individuals presents to privacy. A fine of USD 50 million should catch the attention of individuals and organizations that are tempted to purchase personal and financial information in the manner that Fidelity Federal Bank & Trust did. National legislation that would greatly restrict access to such information would be most preferable. Given that such legislation is not even close to passing, however, lessons learned in cases in which those who gain access to such information without legitimate reasons serve somewhat as a substitute.
8. Bank of Ireland’s compensation of phishing victims stirs controversy Customers of the Bank of Ireland (BoI) have lost at least EUR 110,000 in a phishing ploy in which messages appearing to come from the BoI informed BoI customers that they needed to update their banking details by visiting a Web site that appeared to belong to the BoI. Many customers succumbed to the ploy before the BoI could issue a warning about the scam. After first refusing to compensate victims, the BoI has agreed to compensate them. Nine customers who were victimized had threatened to sue the bank for compensation if their money was not returned to them. Critics are disturbed, however, that the BoI compensated the victim customers, because the compensation that the bank paid might lead to a surge in phishing cases. There are also worries that although the move was a goodwill gesture by the bank, it might lull people into feeling less threatened by scams in the future. Additionally, banks are likely to implement more technologies to push the responsibility back to the customers to safeguard their details and require them to prove that they did not leak their own personal and financial data. The BoI was in a real bind. At first this bank refused to compensate the phishing victims, but the threat of lawsuits by its customers coupled with the possibility of being viewed negatively because of adverse press exposure undoubtedly caused a change of mind. Whether or not the BoI’s having compensated the phishing victims will increase the number of phishing attacks against bank customers and will also make customers more complacent about phishing and other attacks remains to be seen. One thing is suredthe BoI has in effect set a precedent. One would thus at a minimum think that this bank would now adopt measures that would reduce the likelihood of customers succumbing to phishing attacks in the futuredperhaps customer education, more monitoring of mail traffic to see if BoI customers are being targeted, and so on.
9. European Commission proposes changes in data security breach notification A proposal from the European Commission advocates changing European Union (EU) telecommunications regulations to
computers & security 25 (2006) 555–565
include requiring service providers to inform customers and regulators of personal data security breaches. The proposed requirements are similar to the security breach notification laws of California and 33 other states that require notification of data security breaches. Current EU regulations mandate only that network providers notify customers of possible security risks, but do not address security breaches. UK law already closely follows the current EU Directive through the Privacy and Electronic Communications Regulations of 2003. If approved, the proposed change to EU telecommunications regulations would constitute a huge step forward in helping protect potential victims after data security breaches have occurred. Additionally, given the worldwide influence of the EU, having the proposed change approved would put pressure on governments in countries around the world to adopt similar measures.
10. EU and US differ regarding security and privacy concerns for information Belgium-based Swift, which handles international finance transactions, is caught between US and European lawmakers trying to determine jurisdiction over scrutiny of not only its records, but also records of other multinational companies doing business in Europe. The case represents a quandary for both international business and international law. Because commercial matters would be under EU authority, Belgian law might from one perspective be deemed the authority for the case. The US Treasury’s terrorism finance investigation has resulted in Swift being served with subpoenas for details of international financial transactions it conducted for private clients. Any firm that has European information in US jurisdiction as well as information possessed by US organizations (e.g., Google and Hotmail) regarding European interests may also be a target of investigation. Although the EU had agreed to let airlines comply with US demands for trans-Atlantic passenger flight details for security scrutiny, this ruling was later overturned by the European Court of Justice. Security matters are thus outside the jurisdiction of EU regulators who might not impede US subpoenas for European organizations when national security is involved. The Article 29 Data Protection Working Party is lobbying to move the Swift case to the jurisdiction of EU data protection statutes to keep the US from accessing private information possessed by European multinationals. The Working Party is concerned that without a unified EU law, individual countries might be tempted to enter bilateral accords for permitting US investigators to obtain private corporate information. Most EU member countries have expanded their interpretations of the EU Data Protection directive to also cover security matters. This news item once again shows just how far apart the EU and the US are when it comes to dealing with and safeguarding personal information. Conflicts described in this item are bound to be only the tip of the iceberg. The EU places a high value on privacy protection, whereas the US government is obsessed with fighting terrorism. These two entities have little in common when it comes to attitudes about privacy, and, unfortunately, I do not think that either entity is inclined to make a compromise, either.
563
11. Credit card companies form Security Standards Council Competing credit companies American Express, Discover Financial Services, JCB, MasterCard Worldwide, and Visa International recently announced the formation of the Payment Card International (PCI) Security Standards Council to enhance and maintain a uniform security standard, the PCI Data Security Standard, for credit and debit card payments. The PCI Data Security Standard (DSS) went into effect in January 2005 to make payment card transactions more efficient for merchants, payment processors, point-of-sale vendors, financial institutions, and more than a billion card holders around the world. The DSS contains rules that include instructions concerning appropriate information encryption methods, common technical standards, and security audit procedures. In an update of the DSS, the new council gives instructions concerning how to implement the new standards and clarifies previously vague language. The creation of the PCI DSS has already had a huge positive impact on the practice of security within organizations that issue or use credit cards in financial transactions. Having more specific rules, standards, and instructions will not only improve organizations’ ability to comply, but will also make compliance auditing easier. From a security perspective, the PCI Security Standards Council is thus in essence taking a good thing and making it better. Reactions from merchants and others are not likely to be as positive, given that to them the PCI DSS entails cost and inconvenience more than anything else. At the same time, however, I predict that resistance will decrease over time, once merchants and others get used to the rules and standards and incorporate them into their own standards and procedures.
12.
eVoting machines come under more fire
Criticisms of eVoting machines are growing rapidly. More than 80% of US voters will use electronic voting machines in the national election on November 7; a third of all voting precincts are using them for the first time ever. In Pennsylvania voter advocates have filed a lawsuit to prevent counties from using eVoting machines, complaining that these systems leave no paper audit trail that could be used for a recount, audit, or other need. The suit asked the State Commonwealth Court to decertify machines used in 58 of 67 countiesdthe nine other counties use optical scanning systems, the type of system that the plaintiffs maintain should be used throughout the entire state. The plaintiffs allege that votes have been lost several times due to computer malfunctions, including in Allegheny and Centre counties during the May primary and in Berks County in May of last year. New Mexico had a similar suit that forced this state to use optical scan ballots earlier this year. Other suits involving paper-based voting systems have been filed in Arizona, Colorado, and California. In Ohio, a study report by the Board of Commissioners of Cuyahoga County of 467 of 5407 Diebold Election Systems’ eVoting machines showed that even backup paper records meant to assure voters that their votes were tabulated correctly can
564
computers & security 25 (2006) 555–565
be ruined. The report stated that most 10% of the paper copies of votes cast in the election were destroyed, blank, unreadable, missing, taped together, or otherwise compromised. Ohio law requires that each machine include a voter-verified paper audit trail listing each vote that has been cast and considers the paper audit trail to be the official ballot. One-third of the booth workers had difficulty setting up the machines. Forty-five percent had problems closing out machines. Thirty-eight percent had problems with printers or spools. Ninety percent of the voters liked the new systems. Ten percent of the voters reported problems with the machines. And 24 of the storage devices held no election data whatsoever. The report also said that 72% of the polling places had a discrepancy between the electronic record on memory cards and the paper ballots; 42% of the discrepancies involved problems with 25 votes or more. In Maryland, Governor Robert L. Ehrlich Jr., who agreed to buy electronic touch-screen eVoting machines three years ago, now worries that their vulnerability to attackers and lack of paper trail might taint the results of otherwise fair and accurate elections. Ehrlich called for a review by information security consultants; the results indicated that 328 security weaknesses existed, of which 26 were critical. When promised that the flaws would be remedied, Ehrlich agreed to honor the contract. However, according to state officials, Diebold and the state election board have not kept their promises. A test of Diebold machines in Utah demonstrated how easily the machines could be tampered with, prompting California and Pennsylvania to warn their county election administrators to adopt additional security measures. According to a report by the Brennan Center for Justice at the New York University School of Law, the greatest vulnerability to the electronic voting machines is that a software virus could be inserted to alter voter tallies. The momentum for the use of eVoting machines has slowed considerably to the point that if eVoting machine vendors do not do something significant fairly quickly, the future of eVoting may be in jeopardy. States in the US as well as the US government itself rushed into eVoting, thinking that it would provide a safe and more efficient way of tallying election votes. In hindsight, the states and the US government acted irresponsiblydthey rushed into a technology without carefully analyzing and dealing with the many risks in a manner similar to how IT organizations have too often rushed into new technology.
13. USB drive-related risks cause increasing concern Universal Serial Bus (USB) drives (‘‘flash drives’’) are becoming major security concerns because an increasing number of them are being lost or stolen. More than 50% of 484 technology professionals surveyed said USB drives contain confidential information that is unprotected; 20% said that at least one USB drive with data is lost at work each month. Although recent thefts of laptops have received a considerable amount of publicity and have become a major focus of security professionals, the small storage devices are small and inexpensive to the point that many employees do not pay much attention to their disappearance. Sixty six percent of 248 technical security
professionals surveyed who use removable media devices said that their organizations had no security policy despite the fact that they were aware of the risks. Recent events highlight the risk of the USB drives. Wilcox Memorial Hospital in Lihue, Hawaii, reported to 120,000 current and former patients that a flash drive containing their personal information, including names, addresses, SSNs and medical record numbers, disappeared. The use of USB drives has since been forbidden. Earlier this year, flash drives containing sensitive and classified US military information were being sold at a bazaar outside Bagram, Afghanistan. The US Army does not know how they were lost and has since set higher security standards for USB drives. The personal information of 6500 current and former University of Kentucky students, including names, grades, and SSNs, was compromised earlier this year after a professor’s flash drive was pilfered. The lost drive has not been recovered. Some companies are gluing shut the USB ports on their desktop systems so information cannot be downloaded from desktop systems. Several years ago Dr. Jon David published a paper on security-related risks due to use of USB drives in Computers and Security. His paper, well ahead of its time, proved correctdUSB drives and disappearance of sensitive information are strongly linked. Numerous solutions, including some rather ingenious vendor solutions, are available, but for one reason or another more organizations are not using them. USB ports are, after all, not entirely baddthey can be and frequently are used for interfacing with scanning devices and printers, third-party authentication, and more. My prediction is that organizations will increasingly focus on and try to control security-related risks due to the use of USB drives as losses due to their use mount.
14. Business Software Alliance says unlicensed software should result in fines The Business Software Alliance (BSA), a software publishing industry group, is lobbying for assessing penalties for companies that have unlicensed software. Currently any company that uses software without a license and is caught must pay the license fee it should have paid, but in civil courts such companies are exempt from penalties (although considerably tougher penalties exist for counterfeiters trading in unlicensed software). The BSA wants a fine to be added to the cost of the purchase of a license for infringing products. The BSA is not advocating the kind of punishments found elsewheredones that double the retail price of software in damages, but this organization points out that the lack of any penalty for infringement has helped to increase piracy. The BSA said that 80% of the infringement cases it found were due to negligence rather than to malice. Many firms simply fail to keep their software up-to-date. The BSA recommended that the US government require that they check whether their software licenses are up-to-date. The problem of unlicensed software will never go away; even being able to keep the problem within reasonable limits would be a major victory. The BSA, thus faced with an overwhelming task, is taking a most reasonable position, saying in effect that the punishment must fit the crime and that
computers & security 25 (2006) 555–565
the problem is overwhelmingly one of ignorance, not malfeasance. The BSA may or may not get its wish in getting new fines in place, but, unfortunately, any progress that the BSA makes may in effect be largely in vain because enforcing provisions of any laws and regulations concerning unlicensed software will remain as difficult as ever.
15. A push towards data encryption on mobile devices in US government Prompted by numerous data security breaches by US departments and agencies, the US OMB issued a directive saying that the departments and agencies must encrypt data on mobile devices. The US Army is beginning to encrypt information on laptop computers and will soon be requiring its personnel to be accountable for laptops and
565
other mobile devices. New Army policies will require an identification sticker on each laptop and mobile device. Additionally, equipment must be categorized as mobile or non-mobile and labeled as such, and mobile computing devices may not be removed from secure areas unless the information on them is properly encrypted and safeguarded. Nothing in this news item should be very surprising. After the many highly publicized data security breaches that have plagued US government departments and agencies, something had to be done. OMB’s having mandated encryption on mobile devices constituted a big step in the right direction. But as I have said before, encryption is no panacea. Problems such as lost, stolen and corrupted encryption keys and usability difficulties continue to plague the encryption arena. In a sense, OMB’s directive will simply swap one set of problems for another.
computers & security 25 (2006) 566–578
Tightening the net: A review of current and next generation spam filtering tools James Carpinter, Ray Hunt* Department of Computer Science and Software Engineering, University of Canterbury, Christchurch, New Zealand
article info
abstract
Article history:
This paper provides an overview of current and potential future spam filtering approaches.
Received 16 December 2005
We examine the problems spam introduces, what spam is and how we can measure it. The
Revised 12 May 2006
paper primarily focuses on automated, non-interactive filters, with a broad review ranging
Accepted 7 June 2006
from commercial implementations to ideas confined to current research papers. Both machine learning- and non-machine learning-based filters are reviewed as potential solutions
Keywords:
and a taxonomy of known approaches is presented. While a range of different techniques
Spam
have and continue to be evaluated in academic research, heuristic and Bayesian filtering
Ham
dominate commercial filtering systems; therefore, a case study of these techniques is
Heuristics
presented to demonstrate and evaluate the effectiveness of these popular techniques.
Machine learning
ª 2006 Elsevier Ltd. All rights reserved.
Non-machine learning Bayesian filtering Blacklisting
1.
Introduction
The first message recognised as spam was sent to the users of Arpanet in 1978 and represented little more than an annoyance. Today, email is a fundamental tool for business communication and modern life, and spam represents a serious threat to user productivity and IT infrastructure worldwide. While it is difficult to quantify the level of spam currently sent, many reports suggest it represents substantially more than half of all emails sent and predict further growth for the foreseeable future (Espiner, 2005; Radicati Group, 2004; Zeller, 2005). For some, spam represents a minor irritant; for others, a major threat to productivity. According to a recent study by Stanford University (Nie et al., 2004), the average Internet user loses 10 working days each year dealing with incoming
spam. Costs beyond those incurred sorting legitimate (sometimes referred to as ‘ham’) emails from spam are also present: 15% of all emails contain some type of virus payload, and one in 3418 emails contained pornographic images particularly harmful to minors (Wagner, 2002). It is difficult to estimate the ultimate dollar cost of such expenses; however, most estimates place the worldwide cost of spam in 2005, in terms of lost productivity and IT infrastructure investment, to be well over US$10 billion (Jennings, 2005; Spira, 2003). Unfortunately, the underlying business model of bulk emailers (spammers) is simply too attractive. Commissions to spammers of 25–50% on products sold are not unusual (Zeller, 2005). On a collection of 200 million email addresses, a response rate of 0.001% would yield a spammer a return of $25,000, given a $50 product. Any solution to this problem must reduce the profitability of the underlying business
* Corresponding author. Department of Computer Science and Software Engineering, Private Bag 4800, University of Canterbury, Christchurch, New Zealand. Tel.: þ64 3 3642347; fax: þ64 3 3642569. E-mail address: [email protected] (R. Hunt).
computers & security 25 (2006) 566–578
model; by either substantially reducing the number of emails reaching valid recipients, or increasing the expenses faced by the spammer. The similarities between junk postal mail and spam can be immediately recognised; however, the nature of the Internet has allowed spam to grow uncontrollably. Spam can be sent with no cost to the sender: the economic realities that regulate junk postal mail do not apply to the Internet. Furthermore, the legal remedies that can be taken against spammers are limited: it is not difficult to avoid leaving a trace, and spammers easily operate outside the jurisdiction of those countries with anti-spam legislation. The magnitude of the problem has introduced a new dimension to the use of email: the spam filter. Such systems can be expensive to deploy and maintain, placing a further strain on IT budgets. While the reduced flow of spam email into a user’s inbox is generally welcomed, the existence of false positives often necessitates the user manually doublechecking filtered messages; this reality somewhat counteracts the assistance the filter delivers. The effectiveness of spam filters to improve user productivity is ultimately limited by the extent to which users must manually review filtered messages for false positives. Regrettably, no solution has yet been found to this vexing problem. The classification task is complex and constantly changing. Constructing a single model to classify the broad range of spam types is difficult; this task is made near impossible with the realisation that spam types are constantly moving and evolving. Furthermore, most users find false positives unacceptable. The active evolution of spam can be partially attributed to changing tastes and trends in the marketplace; however, spammers often actively tailor their messages to avoid detection, adding a further impediment to accurate detection. The remainder of this section provides supporting material on the topic of spam. Section 2 provides an overview of spam classification techniques. Sections 3.1 and 3.2 provide a more detailed discussion of some of the spam filtering techniques known: given the rapidly evolving nature of this field, it should be considered a snapshot of the critical areas of current research. Section 4 details the evaluation of spam filters, including a case study of the PreciseMail Anti-Spam system operating at the University of Canterbury. Section 5 finishes the paper with some conclusions on the state of this research area.
1.1.
Definition
Spam is briefly defined by the TREC 2005 Spam Track as ‘‘unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient’’ (Cormack and Lynam, 2005b). The key elements of this definition are expanded on in a more extensive definition provided by Mail Abuse Prevention Systems (2004), which specifies three requirements for a message to be classified as spam. Firstly, the message must be equally applicable to many other potential recipients (i.e. the identity of the recipient and the context of the message is irrelevant). Secondly, the recipient has not granted ‘deliberated, explicit and still-revocable permission for it to be sent’. Finally, the
567
communication of the message gives a ‘disproportionate benefit’ to the sender, as solely determined by the recipient. Critically, they note that simple personalisation does not make the identity of the sender relevant and that failure by the user to explicitly opt-out during a registration process does not form consent. Both these definitions identify the predominant characteristic of spam email: that a user receives unsolicited email that has been sent without any concern for their identity.
1.2.
Solution strategies
Proposed solutions to spam can be separated into three broad categories: legislation, protocol change and filtering. A number of governments have enacted legislation prohibiting the sending of spam email, including the USA (Can Spam Act 2004) and the EU (directive 2002/58/EC). American legislation requires an ‘opt-out’ list that bulk mailers are required to provide; this is arguably less effective than the European (and Australian) approach of requiring explicit ‘opt-in’ requests from consumers wanting to receive such emails. At present, legislation has appeared to have little effect on spam volumes, with some arguing that the law has contributed to an increase in spam by giving bulk advertisers permission to send spam, as long as certain rules were followed. Many proposals to change the way in which we send email have been put forward, including the required authentication of all senders, a per email charge and a method of encapsulating policy within the email address (Ioannidis, 2003). Such proposals, while often providing a near complete solution, generally fail to gain support given the scope of a major upgrade or replacement of existing email protocols. Interactive filters, often referred to as ‘challenge–response’ (C/R) systems, intercept incoming emails from unknown senders or those suspected of being spam. These messages are held by the recipient’s email server, which issues a simple challenge to the sender to establish that the email came from a human sender rather than a bulk mailer. The underlying belief is that spammers will be uninterested in completing the ‘challenge’ given the huge volume of messages they sent; furthermore, if a fake email address is used by the sender, they will not receive the challenge. Selective C/R systems issue a challenge only when the (non-interactive) spam filter is unable to determine the class of a message. Challenge–response systems do slow down the delivery of messages, and many people refuse to use the system.1 Non-interactive filters classify emails without human interaction (such as that seen in C/R systems). Such filters often permit user interaction with the filter to customise userspecific options or to correct filter misclassifications; however, no human element is required during the initial classification decision. Such systems represent the most common solution to resolving the spam problem, precisely because of their capacity to execute their task without supervision and without requiring a fundamental change in underlying email protocols. 1 A cynical consideration of this approach may conclude that the recipient considers their time is of more value than the sender’s.
568
1.3.
computers & security 25 (2006) 566–578
Statistical evaluation
Common experimental measures include spam recall (SR), spam precision (SP), F1 and accuracy (A) (see Fig. 1 for formal definitions of these measures). Spam recall is effectively spam accuracy. A legitimate email classified as spam is considered to be a ‘false positive’; conversely, a spam message classified as legitimate is considered to be a ‘false negative’. The accuracy measure, while often quoted by product vendors, is generally not useful when evaluating anti-spam solutions. The level of misclassifications (1 A) consists of both false positives and false negatives; clearly a 99% accuracy rate with 1% false negatives (and no false positives) is preferable to the same level of accuracy with 1% false positives (and no false negatives). The level of false positives and false negatives is of more interest than total system accuracy. Furthermore, accuracy can be severely distorted by the composition of the corpus; clearly, if the false positive and negative rates are different, overall accuracy will largely be determined by the ratio of legitimate email to spam. A clear trade-off exists between false positive and false negative statistics: reducing false positives often means letting more spam through the filter. Therefore, the reported levels of either statistics will be significantly affected by the classification threshold employed during the evaluation. False positives are regarded as having a greater cost than false negatives; cost sensitive evaluation can be used to reflect this difference. This imbalance is reflected in the l term: misclassification of a legitimate email as spam is considered to be l times as costly as misclassifying a spam email as legitimate. l values of 1, 9 and 999 are often used (Sakkis et al., 2001a; Marı´a and Hidalgo, 2002) to represent the cost differential between false positives and false negatives; however, no evidence exists (Marı´a and Hidalgo, 2002) to support the assumption that a false positive is 9 or 999 times more costly as a false negative. The value of l is difficult to quantify, as it depends largely on the likelihood of a user noticing a misclassification and on the importance of the email in question. The definition and measurement of this cost imbalance (l) is an open research problem. The recall measure (see Fig. 1) defines the number of relevant documents identified as a percentage of all relevant documents; this measures a spam filter’s ability to accurately identify spam (as 1 SR is the false negative rate). The precision measure defines the number of relevant documents identified as a percentage of all documents identified; this shows the noise that filter presents to the user (i.e. how many of the messages classified as spam will actually be spam). A trade-off, similar to that between false positives
Fig. 1 – Common experimental measures for the evaluation of spam filters.
and negatives, exists between recall and precision. F1 is the harmonic mean of the recall and precision measures and combines both into a single measure. As an alternative measure, Marı´a and Hidalgo (2002) suggest ROC curves (receiver operating characteristics) (see Fig. 2 for an example). The curve shows the trade-off between true positives and false positives as the classification threshold parameter within the filter is varied. If the curve corresponding to one filter is uniformly above that corresponding to another, it is reasonable to infer that its performance exceeds that of the other for any combination of evaluation weights and external factors (Cormack and Lynam, 2004); the performance differential can be quantified using the area under the ROC curves. The area represents the probability that a randomly selected spam message will receive a higher ‘score’ than a randomly selected legitimate email message, where the ‘score’ is an indication of the likelihood that the message is spam.
2.
Overview
Filter classification strategies can be broadly separated into two categories: those based on machine learning (ML) principles and those not based on ML (see Fig. 3). Non-machine learning techniques, such as heuristics, blacklisting and signatures, have been complemented in recent years with new ML-based technologies. In the last 3–4 years, a substantial academic research effort has taken place to evaluate new ML-based approaches to filtering spam; however, this work is ongoing. ML filtering techniques can be further categorised (see Fig. 3) into complete and complementary solutions. Complementary solutions are designed to work as a component of a larger filtering system, offering support to the primary filter (whether it be ML or non-ML based). Complete solutions aim to construct a comprehensive knowledge base that allows them to classify all incoming messages independently. Such complete solutions come in a variety of flavours: some aim to build a unified model, some compare incoming email to previous examples (previous likeness), while others use a collaborative approach, combining multiple classifiers to evaluate email (ensemble). Filtering solutions operate at one of two levels: at the mail server or as part of the user’s mail program (see Fig. 4). Serverlevel filters examine the complete incoming email stream, and filter it based on a universal rule set for all users. Advantages of such an approach include centralised administration and maintenance, limited demands on the end user, and the ability to reject or discard email before it reaches the destination. User-level filters are based on a user’s terminal, filtering incoming email from the network mail server as it arrives. They often form a part of a user’s email program. ML-based solutions often work best when placed at the user level (Garcia et al., 2004), as the user is able to correct misclassifications and adjust rule sets. Spam filtering systems can be operated either on-site or off-site. On-site solutions can give local IT administrators greater control and more customisation options, in addition to relieving any security worries about redirecting email off-site
computers & security 25 (2006) 566–578
569
Fig. 2 – An example of an ROC curve. In this fictional experiment, filter-2 (the upper line) is clearly superior as it has a higher level of true positives at every level of false positives.
for filtering. According to Cain (2003), of the META Group, it is likely that on-site solutions are cheaper than their service (off-site) counterparts. He estimates on-premises’ solutions have a cost of US$6–12 per user (based on one gateway server and 10,000 users), compared to a cost of US$12–24 per user for a similar hosted (off-site) solution. On-site filtering can take place at both the hardware and software levels. Software-based filters comprise many commercial and most open source products, which can operate at either the server or user level. Many software implementations will operate on a variety of hardware and software combinations (Schneider, 2004). Appliance (hardware-based) on-site solutions use a piece of hardware dedicated to email filtering. These are generally quicker to deploy than a similar software-based solution, given that the device is likely to be transparent to network traffic (Nutter, 2004). The appliance is likely to contain optimised hardware for spam filtering, leading to potentially better performance than a general-purpose machine running a software-based solution. Furthermore, general-purpose platforms, and in particular their operating systems, may have inherent security vulnerabilities: appliances may have pre-hardened operating systems (Chiu, 2004). Off-site solutions (service) are based on the subscribing organisation redirecting their MX records2 to the off-site vendor, who then filters the incoming email stream, before redirecting the email back to the subscriber (Postini Inc., 2004). Theoretically, spam email will never enter the subscriber’s network. Given that the organisation’s email traffic will flow through external data centres, this raises some security issues: some vendors will only process incoming email in memory, while others will store to disk (Cain, 2003). While plain-text email sent over the Internet can hardly be considered secure, any 2 Mail exchange records are found in a domain name database and specify the email server used for handling emails addressed to that domain.
organisation that employs an off-site solution is likely to have a vested interest in ensuring a large repository of their email is not retained over time and that only authorised personnel have access to what could be effectively used as an electronic wiretap on their email communications. Initial setup of an off-site filter option is substantially quicker: it can be operational within a week, while similar software solutions can take IT staff between 4 and 8 weeks to install, tune and test (Cain, 2003). Off-site solutions require only a supervisory presence from local IT staff and no upfront hardware or software investments in exchange for a monthly fee.
3.
Filter technologies
3.1.
Non-machine learning
3.1.1.
Heuristics
Heuristic, or rule-based, analysis uses regular expression rules to detect phrases or characteristics that are common to spam; the quantity and seriousness of the spam features identified will suggest the appropriate classification for the message. The historical and current popularity of this technology has largely been driven by its simplicity, speed and consistent accuracy. Furthermore, it is superior to many advanced filtering technologies in the sense that it does not require a training period. A simple heuristic filtering system may assign an email a score based upon the number of rules it matches. If an email’s score is higher than a pre-defined threshold, the email will be classified as spam. For example, a rule may award a score of 5.0 if the word ‘viagra’ is included in the subject and another rule may award a score of 3.0 if ‘refinance your home’ is in the subject. An email with the text ‘viagra refinance your home’ in the subject line would receive a score of 8.0,
570
computers & security 25 (2006) 566–578
Fig. 3 – Classification of the various approaches to spam filtering detailed in Section 2.
assuming no other rules were in place. If this value was higher than the threshold, the email would be classed as spam. However, in light of new filtering technologies, it has several drawbacks. It is based on a static rule set: the system cannot adapt the filter to identify emerging spam characteristics. This requires the administrator to construct new detection heuristics or regularly download new generic rule files. The rule set used by a particular product will be well known: it will be largely identical across all installation sites. Therefore, if a spammer can craft a message to penetrate the filter of a particular vendor, their messages will pass unhindered to all mail servers using that particular filter. Open source heuristic filters provide both the filter and the rule set for download, allowing the spammer to test their message for its penetration ability. Graham (2002) acknowledges the potentially high levels of accuracy achievable by heuristic filters, but believes that as they are tuned to achieve near 100% accuracy, an unacceptable level of false positives will result. This prompted his investigation of Bayesian filtering (see Sections 3.2.1 and 4.2).
3.1.2.
Signatures
Signature-based techniques generate a unique hash value (signature) for each known spam message. Signature filters compare the hash value of an incoming email against all stored hash values of previously identified spam emails to classify the email. Signature generation techniques make it
statistically improbable for a legitimate email message to have the same hash as a spam message. This allows signature filters to achieve a very low level of false positives. However, signature-based filters are unable to identify spam emails until such time as the email has been reported as spam and its hash distributed. Furthermore, if the signature distribution network is disabled, local filters will be unable to catch newly created spam messages. Simple signature matching filters are trivial for spammers to work around. By inserting a string of random characters in each spam message sent, the hash value of each message will be changed. This has led to new, advanced hashing technique, which can continue to match spam messages that have minor changes aimed at disguising the message. Spammers do have a window of opportunity to promote their messages before a signature is created and propagated amongst users. Furthermore, for the signature filter to remain efficient, the database of spam hashes has to be properly managed; the most common technique is to remove older hashes (Process Software, 2004). Once the spammer’s message hash has been removed from the network, they can resume sending their message; it is unclear to what extent this is a problem given the continuing evolution of spam and the ease of which a new spam could be generated. Commercial signature filters typically integrate with the organisation’s mail server and communicate with a centralised signature distribution server to receive and submit
Fig. 4 – Potential locations for spam filters to be positioned.
computers & security 25 (2006) 566–578
spam email signatures (e.g. Cloudmark). Distributed and collaborative signature filters also exist (e.g. Vipul’s Razor). Such filters require sophisticated trust safeguards to prohibit the network’s penetration and destruction by a malicious spammer while still allowing users to contribute spam signatures. Yoshida et al. (2004) use a combination of hashing and document space density to identify spam. Substrings of length L are extracted from the email, and hash values generated for each. The first N hash values form a vector representation of the email. This allows similar emails to be identified and their frequency is recorded; given the high volumes of email spammers are required to send to generate a worthwhile economic benefit, there is a heavy maldistribution of spam email traffic which allows for easy identification. Document space density is therefore used to separate spam from legitimate email, and when this method is combined with a short whitelist for solicited mass email, the authors report results of 98% recall and 100% precision, using over 50 million actual pieces of email traffic. Damiani et al. (2004a) use message digests, addresses of the originating mail servers and URLs within the message to identify spam mail. Each message maps to a 256-bit digest, and is considered the same as another message if it differed by at most 74 bits. Previous work (Damiani et al., 2004b) has identified that this approach is robust against attempts to disguise the message. This email identification approach is then implemented within a P2P (peer-to-peer) architecture. Similarly, Gray and Haahr (2004) present the CASSANDRA architecture for a personalised, collaborative spam filtering system, using a signature-based filtering technology and P2P distribution network.
3.1.3.
571
However, such lists often have a notoriously high rate of false positives, making them ‘‘dangerous’’ to use as a standalone filtering system (Snyder, 2004). Once blacklisted, spammers can cheaply acquire new addresses. Often several people must complain before an address is blacklisted; by the time the list is updated and distributed, the spammer can often send millions of spam messages. Spammers can also masquerade as legitimate sites. Their motivation here is twofold: either they will escape being blacklisted or they will cause a legitimate site to be blacklisted (reducing the accuracy, and therefore the attractiveness, of the DNS blacklist) (Process Software, 2004). Several filters now use such lists as part of a complete filtering solution, weighting information provided by the DNS blacklist and incorporating it into results provided by other techniques to produce a final classification decision.
3.1.4.
Traffic analysis
Gomes et al. (2004) provide a characterisation of spam traffic patterns. By examining a number of email attributes, they are able to identify characteristics that separate spam traffic from non-spam traffic. Several key workload aspects differentiate spam traffic; including the email arrival process, email size, number of recipients per email, and popularity and temporal locality among recipients. An underlying difference in purpose gives rise to these differences in traffic: legitimate mail is used to interact and socialise, where spam is typically generated by automatic tools to contact many potentials, mostly unknown users. They consider their research as the first step towards defining a spam signature for the construction of an advanced spam detection tool.
Blacklisting
Blacklisting is a simplistic technique that is common within nearly all filtering products. Also known as block lists, blacklists filter out emails received from a specific sender. Whitelists, or allow lists, perform the opposite function, automatically allowing emails from a specific sender. Such lists can be implemented at the user or at the server level, and represent a simple way to resolve minor imperfections created by other filtering techniques, without drastically overhauling the filter. Given the simplistic nature of technology, it is unsurprising that it can be easily penetrated. The sender’s email address within an email can be faked, allowing spammers to easily bypass blacklists by inserting a different (fake) sender address with each bulk mailing. Correspondingly, whitelists can also be targeted by spammers. By predicting likely whitelisted emails (e.g. all internal email addresses, your boss’s email address), spammers can penetrate other filtering solutions in place by appropriately forging the sender address. DNS blacklisting operates on the same principles, but maintains a substantially larger database. When an SMTP session is started with the local mail server, the foreign host’s address is compared against a list of networks and/or servers known to allow the distribution of spam. If a match is recorded, the session is immediately closed, preventing the delivery of the spam message. This filtering approach is highly effective at discarding substantial amounts of spam emails, yet requires low system requirements to operate, and enabling it often requires only minimal changes to the mail server and filtering solution.
3.2.
Machine learning
3.2.1.
Unified model filters
Bayesian filtering now commonly forms a key part of many enterprise-scale filtering solutions. No other machine learning or statistical filtering technique has achieved such widespread implementation and therefore represents the ‘state-of-the-art’ approach in industry. It addresses many of the shortcomings of heuristic filtering. It uses an unknown (to the sender) rule set: the tokens and their associated probabilities are manipulated according to the user’s classification decisions and the types of email received. Therefore each user’s filter will classify emails differently, making it impossible for a spammer to craft a message that bypasses a particular brand of filter. The rule set is also adaptive: Bayesian filters can adapt their concepts of legitimate and spam emails, based on user feedback, which continually improves filter accuracy and allows detection of new spam types. Bayesian filters maintain two tables: one of spam tokens and one of ‘ham’ (legitimate) mail tokens. Associated with each spam token is a probability that the token suggests that the email is spam, and likewise for ham tokens. For example, Graham (2002) reports that the word ‘sex’ indicates a 0.97 probability that an email is spam. Probability values are initially established by training the filter to recognise spam and legitimate emails, and are then continually updated
572
computers & security 25 (2006) 566–578
(and created) based on the emails that the filter successfully classifies. Incoming email is tokenised on arrival, and each token is matched with its probability value from the user’s records. The probability associated with each token is then combined, using Bayes’ Rule, to produce an overall probability that the email is spam. An example is provided in Fig. 5. Bayesian filters perform best when they operate on the user level, rather than at the network mail server level. Each user’s email and definition of spam differs; therefore a token database populated with user-specific data will result in more accurate filtering (Garcia et al., 2004). The use of Bayes formula as a tool to identify spam was initially applied to spam filtering in 1998 by Sahami et al. (1998) and Pantel and Lin (1998). Graham (2002, 2003) later implemented a Bayesian filter that caught 99.5% of spam with 0.03% false positives. Androutsopoulos et al. (2000b) established that a naive Bayesian filter clearly surpasses keyword-based filtering, even with a very small training corpus. More recently, Zdziarski (2004) has introduced Bayesian noise reduction as a way of increasing the quality of the data provided to a naive Bayes classifier. It removes irrelevant text to provide more accurate classification by identifying patterns of text that are commonplace for the user. Given the high levels of accuracy that a Bayesian filter can potentially provide, it has unsurprisingly emerged as a standard used to evaluate new filtering technologies. Despite such prominence, few Bayesian commercial filters are fully consistent with Bayes’ Rule, creating its own artificial scoring systems rather than relying on the raw probabilities generated (Vaughan-Nichols, 2003). Furthermore, filters generally use ‘naive’ Bayesian filtering, which assumes that the occurrence of events is independent of each other; i.e. such filters do not consider that the words ‘special’ and ‘offers’ are more likely to appear together in spam email than in legitimate email. In attempt to address this limitation of standard Bayesian filters, Yerazunis et al. (2003, 2004) introduced sparse binary polynomial hashing (SBPH) and orthogonal sparse bigrams (OSB). SBPH is a generalisation of the naive Bayesian filtering method, with the ability to recognise mutating phrases in addition to individual words or tokens, and uses the Bayesian Chain Rule to combine the individual feature conditional probabilities. Yerazunis et al. reported results that exceed
99.9% accuracy on real-time email without the use of whitelists or blacklists. An acknowledged limitation of SBPH is that the method may be too computationally expensive; OSB generates a smaller feature set than SBPH, decreasing memory requirements and increasing speed. A filter based on OSB, along with the non-probabilistic Winnow algorithm as a replacement for the Bayesian Chain Rule, saw accuracy peak at 99.68%, outperforming SBPH by 0.04%; however, OSB used just 600,000 features, substantially less than the 1,600,000 features required by SBPH. Support vector machines (SVMs) are generated by mapping training data in a nonlinear manner to a higher-dimensional feature space, where a hyperplane is constructed which maximises the margin between the sets. The hyperplane is then used as a nonlinear decision boundary when exposed to real-world data. Drucker et al. (1999) applied the technique to spam filtering, testing it against three other text classification algorithms: Ripper, Rocchio and boosting decision trees. Both boosting trees and SVMs provided ‘‘acceptable’’ performance, with SVMs preferable given their lesser training requirements. An SVM-based filter for Microsoft Outlook has also been tested and evaluated (Woitaszek et al., 2003). Rios and Zha (2004) also experiment with SVMs, along with random forests (RFs) and naive Bayesian filters. They conclude that SVM and RF classifiers are comparable, with the RF classifier more robust at low false positive rates; they both outperform the naive Bayesian classifier. While chi by degrees of freedom has been used in authorship identification, it was first applied by O’Brien and Vogel (2003) to spam filtering. Ludlow (2002) concluded that tens of millions of spam emails may be attributable to 150 spammers; therefore authorship identification techniques should identify the textual fingerprints of this small group. This would allow a significant proportion of spam to be effectively filtered. This technique, when compared with a Bayesian filter, was found to provide equally good or better results. Clark et al. (2003) construct a backpropogation trained artificial neural network (ANN) classifier named LINGER. ANNs require relatively substantial amount of time for parameter selection and training, when compared against other previously evaluated methods. The classifier can go beyond the standard spam/legitimate email decision, instead of classifying incoming emails into an arbitrary number of folders.
Fig. 5 – A simple example of Bayesian filtering.
computers & security 25 (2006) 566–578
LINGER outperformed naive Bayesian, k-NN, stacking, stumps and boosted trees filtering techniques, based on their reported results, recording perfect results (across many measures) on all tested corpora, for all l. LINGER also performed well when feature selection was based on a different corpus to which it was trained and tested. Chhabra et al. (2004) present a spam classifier based on a Markov Random Field (MRF) model. This approach allows the spam classifier to consider the importance of the neighbourhood relationship between words in an email message (MRF cliques). The inter-word dependence of natural language can therefore be incorporated into the classification process; this is normally ignored by naive Bayesian classifiers. Characteristics of incoming emails are decomposed into feature vectors and are weighted in a superincreasing manner, reflective of inter-word dependence. Several weighting schemes are considered, each of which differently evaluates increasingly long matches. Accuracy over 5000 test messages is shown to be superior to that shown by a naive Bayesian-equivalent classifier (97.98% accurate), with accuracy reaching 98.88% with a window size (i.e. maximum phrase length) of five and an exponentially superincreasing weighting model.
3.2.2.
Previous likeness based filters
Memory-based, or instance-based, machine learning techniques classify incoming emails according to their similarity to stored examples (i.e. training emails). Defined email attributes form a multi-dimensional space, where new instances are plotted as points. New instances are then assigned to the majority class of its k closest training instances, using the k-Nearest-Neighbour algorithm, which classifies the emails. Sakkis et al. (2000, 2001) use a k-NN spam classifier, implemented using the TiMBL memory-based learning software (Daelemans et al., 2000). The basic k-NN classifier was extended to weight attributes according to their importance and to weight nearer neighbours with greater importance (distance weighting). The classifier was compared with a naive Bayesian classifier using cost sensitive evaluation. The memory-based classifier compares ‘‘favourably’’ to the naive Bayesian approach, with spam recall improving at all levels (1, 9, 999) of l, with a small cost of precision at l ¼ 1, 9. The authors conclude that this is a ‘‘promising’’ approach, with a number of research possibilities to explore. Case-based reasoning (CBR) systems maintain their knowledge in a collection of previously classified cases, rather than in a set of rules. Incoming email is matched against similar cases in the system’s collection, which provide guidance towards the correct classification of the email. The final classification, along with the email itself, then forms part of the system’s collection for the classification of future emails. Cunningham et al. (2003) construct a case-based reasoning classifier that can track concept drift. They propose that the classifier both adds new cases and removes old cases from the system collection, allowing the system to adapt to the drift of characteristics in both spam and legitimate mails. An initial evaluation of their classifier suggests that it outperforms naive Bayesian classification. This is unsurprising given that naive Bayesian filters attempt to learn a ‘‘unified spam concept’’ that will identify all spam emails; spam email differs significantly depending on the product or service on offer.
573
Rigoutsos and Huynh (2004) apply the Teiresias pattern discovery algorithm to email classification. Given a large collection of spam emails, the algorithm identifies patterns that appear more than twice in the corpus. Negative training occurs by running the pattern identification algorithm over legitimate email; patterns common to both corpora are removed from the spam vocabulary. Successful classification relies on training the system based on a comprehensive and representative collection of spam and legitimate emails. Experimental results are based on a training corpus of 88,000 pieces of spam and legitimate emails. Spam precision was reported at 96.56%, with a false positive rate of 0.066%.
3.2.3.
Ensemble filters
Stacked generalisation is a method of combining classifiers, resulting in a classifier ensemble. Incoming email messages are first given to ensemble component classifiers whose individual decisions are combined to determine the class of the message. Improved performance is expected given that different ground-level classifiers generally make uncorrelated errors. Sakkis et al. (2001b) create an ensemble of two different classifiers: a naive Bayesian classifier (Androutsopoulos et al., 2000a,b) and a memory-based classifier (Sakkis et al., 2001a; Androutsopoulos et al., 2000c). Analysis of the two component classifiers indicated they tend to make uncorrelated errors. Unsurprisingly, the stacked classifier outperforms both of its component classifiers on a variety of measures. The boosting process combines many moderately accurate weak rules (decision stumps) to induce one accurate, arbitrarily deep, decision tree. Carreras and Ma´rquez (2001) use the AdaBoost boosting algorithm and compare its performance against spam classifiers based on decision trees, naive Bayesian and k-NN methods. They conclude that their boosting based methods outperform standard decision trees, naive Bayes, k-NN and stacking, with their classifier reporting F1 rates above 97% (see Section 1.3). The AdaBoost algorithm provides a measure of confidence with its predictions, allowing the classification threshold to be varied to provide a very high precision classifier.
3.2.4.
Complementary filters
Adaptive spam filtering (Pelletier et al., 2004) targets spam by category. It is proposed as an additional spam filtering layer. It divides an email corpus into several categories, each with a representative text. Incoming email is then compared with each category, and a resemblance ratio generated to determine the likely class of the email. When combined with Spamihilator, the adaptive filter caught 60% of the spam that passed through Spamihilator’s keyword filter. Boykin and Roychowdhury (2005) identify a user’s trusted network of correspondents with an automated graph method to distinguish between legitimate and spam emails. The classifier was able to determine the class of 53% of all emails evaluated, with 100% accuracy. The authors intend this filter to be part of a more comprehensive filtering system, with a content-based filter responsible for classifying the remaining messages. Golbeck and Hendler (2004) constructed a similar network from ‘trust’ scores, assigned by users to people they know. Trust ratings can then be inferred about unknown users, if the users are connected via a mutual acquaintance(s).
574
computers & security 25 (2006) 566–578
Content-based email filters work best when words inside the email text are lexically correct; i.e. most will rapidly learn that the word ‘viagra’ is a strong indicator of spam, but may not draw the same conclusions from the word ‘V.i-a.g*r.a’. Assuming the spammer continues to use the obfuscated word, the content-based filter will learn to identify it as spam; however, given the number of possibilities available to disguise a word, most standard filters will be unable to detect these terms in a reasonable amount of time. Lee and Ng (2005) use a hidden Markov model in order to deobfuscate text. Their model is robust to many types of obfuscation, including substitutions and insertions of non-alphabetic characters, straightforward misspellings and the addition and removal of unnecessary spaces. When exposed to 60 obfuscated variants of ‘viagra’, their model successfully deobfuscated 59, and recorded an overall deobfuscation accuracy of 94% (across all test data). Spammers typically use purpose-built applications to distribute their spam (Hunt and Cournane, 2004). Greylisting tries to deter spam by rejecting email from unfamiliar IP addresses, by replying with a soft fail (i.e. 4xx). It is built on the premise that the so-called ‘spamware’ (Levine, 2005) does little or no error recovery, and will not retry to send the message. Any correct client should retry; however, some do not (either due to a bug or policy), so there is the potential to lose legitimate email. Also, legitimate email can be unnecessarily delayed; however, this is mitigated by source IP addresses being automatically whitelisted after they have successfully retried once. An analysis performed by Levine (2005) over a 7-week period (covering 715,000 delivery attempts), 20% of attempts were greylisted; of those, only 16% retried. Careful system design can minimise the potential for lost legitimate email; certainly greylisting is an effective technique for rejecting spam generated by poorly implemented spamware. SMTP path analysis (Leiba et al., 2005) learns the reputation of IP addresses and email domains by examining the paths used to transmit known legitimate and spam emails. It uses the ‘received’ line that the SMTP protocol requires that each SMTP relay adds to the top of each email processed, which details its identity, the processing timestamp and the source of the message. Despite the fact that these headers can easily be spoofed, when operating in combination with a Bayesian filter, overall accuracy is approximately doubled.
4.
Evaluation
4.1.
Barriers to comparison
This paper outlines many new techniques researched to filter spam email. It is difficult to compare the reported results of classifiers presented in various research papers given that each author selects a different corpus of email for evaluation. A standard ‘benchmark’ corpus, comprising both spam and legitimate emails is required in order to allow meaningful comparison of reported results of new spam filtering techniques against existing systems. However, this is far from being a straightforward task. Legitimate email is difficult to find: several publicly available repositories of spam exist (e.g. http://www.spamarchive.org);
however, it is significantly more difficult to locate a similarly vast collection of legitimate emails, presumably due to the privacy concerns. Spam is also constantly changing. Techniques used by spammers to communicate their message are continually evolving (Hunt and Cournane, 2004); this is also seen, to a lesser extent, in legitimate email. Therefore, any static spam corpus would, over time, no longer resemble the makeup of current spam email. Graham-Cumming (2005), maintainer of the Spammers’ Compendium, has identified 18 new techniques used by spammers to disguise their messages between 14 July 2003 and 14 January 2005. A total of 45 techniques are currently listed (as of 11 December 2005). While the introduction of modern spam construction techniques will affect a spam filter’s ability to detect the actual content of the message, it is important to note that most heuristic filter implementations are updated regularly, both in terms of the rule set and underlying software. Several alternatives to a standard corpus exist. SpamAssassin (http://spamassassin.apache.org) maintains a collection of legitimate and spam emails, categorised into easy and hard examples. However, the corpus is now more than 2 years old. Androutsopoulos et al. (2000a) have built the ‘Ling-Spam’ corpus, which imitates legitimate email by using the postings of the moderated ‘Linguist’ mailing list. The authors acknowledge that the messages may be more specialised in topic than received by a standard user but suggest that it can be used as a reasonable substitute for legitimate email in preliminary testing. SpamArchive maintains an archive of spam contributed by users. Archives are created that contain all spam received by the archive on a particular day, providing researchers with an easily accessible collection of up-to-date spam emails. As a result of the Enron bankruptcy, 400 MB of realistic workplace email has become publicly available: it is likely that this will form part of future standard corpora, despite some outstanding issues (Cormack and Lynam, 2005a). Building an artificial corpus or a corpus from presorted user email ensures the class of each message is known with certainty. However, when dealing with a public corpus (like the Enron corpus), it is more difficult to determine the actual class of a message for accurate evaluation of filter performance. Therefore, Cormack and Lynam (2005a) propose establishing a ‘gold standard’ for each message, which is considered to be the message’s actual class. They use a bootstrap method based on several different classifiers to simplify the task of sorting through this massive collection of email; it remains as a work in progress. Their filter evaluation toolkit, given a corpus and a filter, compares the filter classification of each message with the gold standard to report effectiveness measures with 95% confidence limits. In order to compare different filtering techniques, a standard set of legitimate and spam emails must be used; both for the testing and the training (if applicable) of filters. Independent tests of filters are generally limited to usable commercial and open source products, excluding experimental classifiers appearing only in research. Experimental classifiers are generally only compared against standard techniques (e.g. Bayesian filtering) in order to establish their relative effectiveness; however, this makes it difficult to isolate the most
computers & security 25 (2006) 566–578
promising new techniques. NetworkWorldFusion (Snyder, 2004) review 41 commercial filtering solutions, while Cormack and Lynam (2004) review six open source filtering products.
4.2.
Case study
Throughout this paper we have discussed the advances made in spam filtering technology. In this section, we evaluate the extent to which users at the University of Canterbury could potentially benefit from these advances in filtering techniques. Furthermore, we hope to collect data to substantiate some recommendations when evaluating spam filters. The University of Canterbury maintains a two-stage email filtering solution. A subscription DNS blacklisting system is used in conjunction with Process Software’s PreciseMail Antispam System (PMAS). The University of Canterbury receives approximately 110,000 emails per day, of which approximately 50,000 are eliminated by the DNS blacklisting system before delivery is complete. Of those emails that are successfully delivered, PMAS discards around 42% and quarantines around 35% for user review. In its standard state, PMAS filters are based on a comprehensive heuristic rule collection and are combined with both server-level and user-level blocks and allow lists. However, the software has a Bayesian filtering option, that works in conjunction with the heuristic filter, and which was not currently active before the evaluation. Two experiments were conducted. The first used the publicly available SpamAssassin corpus to provide a comparable evaluation of PMAS in terms of false positives and false negatives. This experiment aimed to evaluate the overall performance of the filter, as well as the relative performance of the heuristic and Bayesian components. The second used spam collected from the SpamArchive repository to evaluate false positive levels on spam collected at various points over the last 2 years. The aim of this experiment was to observe whether the age of spam has any effect on the effectiveness of the filter, as well as attempting to compensate for the age of the SpamAssassin corpus. The training of the PMAS Bayesian filter took place over 2 weeks. PMAS automatically (as recommended by the vendor) trains the Bayesian filter by showing its emails that score3 above and below defined thresholds, as examples of spam and non-spam, respectively. The results of passing the partial SpamAssassin corpus through the PMAS filter can be seen in Fig. 6. The partial corpus has the ‘hard’ spam removed, which consists of emails with unusual HTML markup, coloured text, spam-like phrases, etc. The use of the full corpus increases false positives made by the overall filter from 1 to 4% of all legitimate messages filtered. The spam corpus drawn from the SpamArchive was constructed from the spam email submitted manually (by users) to SpamArchive on the 14, 15 and 16th of each month used. These dates were randomly chosen. The total number of emails collected at each point varied from approximately 1700 to 3200. The performance of each filter (heuristic, Bayesian and combined) steadily declined over time as newer spam from 3
Scores were generated by the heuristic filter.
575
the SpamAssassin corpus was introduced. It is assumed that spam more recently submitted to the archive would be more likely to employ newer message construction techniques. No effort has been made to individually examine the test corpus to identify these characteristics. Any person with an email account can submit spam to the archive: this should create a sufficiently diverse catchment base, ensuring a broad range of spam messages are archived. A broad corpus of spam should reflect, to some extent, new spam construction techniques. The fact that updates are regularly issued by major anti-spam product vendors indicates that such techniques are becoming widespread. Overall results are consistent with those published by NetworkWorldFusion (Snyder, 2004): they recorded 0.75% false positives, and 96% accuracy, while we recorded 0.75% (with the partial SpamAssassin corpus) false positives and 97.67% accuracy. Under both the full and partial SpamAssassin corpora, the combined filtering option surpasses the alternatives in the two key areas: a lower level of false positives, and a higher level of spam caught (i.e. discarded). This can be clearly seen in Fig. 6. In terms of these measures, the heuristic filter is closest to the performance of the combined filter. This is unsurprising given that the Bayesian component of the combined filter contributes relatively little and that it was initially trained by the heuristic filter. The Bayesian filter performs comparatively worse than the other two filtering options, as less email is correctly treated (i.e. spam discarded or ham forwarded) and notably more email is quarantined for user review. This is consistent with Garcia et al. (2004), who suggested such a filtering solution was best placed at the user level, rather than the server level. The performance of the heuristic filter deteriorates as messages get more recent. This would suggest that the PMAS rule set and underlying software has greater difficulty in identifying a spam message when its message is deliberately obscured by advanced spam construction techniques. This is despite regular updates to the filter rule set and software. The combined filter performs similarly to the heuristic filter. This is unsurprising given that the heuristic filter contributes the majority of the message’s score (which then determines the class of the message). The introduction of Bayesian filtering improved overall filter performance in all respects when dealing with both the SpamAssassin archive and the SpamArchive collections. The results from the Bayesian filter are less obvious. One would expect the Bayesian filter to become more effective over time, given that it has been trained exclusively on more recent messages. In the broadest sense, this can be observed: the filter’s performance improves by 7% on January 2005 collection when compared against July 2003 collection. However, the filter performs best on the 2004 collections (January and July). It is possible that this is due to the training of the Bayesian filter; the automated training performed by PMAS may have incorrectly added some tokens to the ham/spam databases. Furthermore, the spam received by the University of Canterbury may not reflect the spam received by the SpamArchive; this would therefore impact the training of the Bayesian filter. New spam construction techniques are likely to have impacted on the lower spam accuracy scores; heuristic filters
576
computers & security 25 (2006) 566–578
PMAS performance with SpamAssassin corpus 90 80
Combined Bayesian Heuristic
70
Percentage
60 50 40 30 20 10 0
Spam forwarded
Spam quarantined
Spam discarded
Ham forwarded
Ham quarantined
Ham discarded
Fig. 6 – Performance of the PMAS filtering elements using the partial SpamAssassin public corpus.
seem especially vulnerable to these developments. It is reasonable to say that such techniques are effective: a regularly updated heuristic filter becomes less effective and therefore reinforces the need for a complementary machine learning approach when assembling a filtering solution. Broadly, one can conclude two things from this experiment. Firstly, the use of a Bayesian filtering component improves overall filter performance; however, it is not a substitute for the traditional heuristic filter, but more a complement (at least at the server level). Secondly, the concerns raised about the effects of time on the validity of the corpora seem to be justified: older spam does seem to be more readily identified, suggesting changing techniques. It is interesting to note that, despite improved performance, the Bayesian filtering component was deactivated some months after the completion of this evaluation due to increasing CPU and memory demands on the mail filtering gateway. This can be primarily attributed to the growth of the internal token database, as the automatic training system remained active throughout the period; arguably this could have been disabled once a reasonably sized database had been constructed but this would have negated some of the benefits realised by a machine learning-based filtering system (such as an adaptive rule set). This is a weakness of both the implementation, as no mechanism was provided to reduce the database size, and of the Bayesian approach and unified model machine learning approaches in general. When constructing a unified model, the text of each incoming message affects the current model; however, reversing these changes can be particularly difficult. In the case of a Bayesian filter, a copy of each message processed (or some kind of representative text) would be necessary to reverse the impact of past messages on the model.
5.
Conclusion
Spam has the potential to become a very serious problem for the Internet community, threatening both the integrity of
networks and the productivity of users. Anti-spam vendors offer a wide array of products designed to keep spam out; these are implemented in various ways (software, hardware or service) and at various levels (server and user). The introduction of new technologies, such as Bayesian filtering, is improving filter accuracy; we have confirmed this for ourselves after examining the PreciseMail Anti-Spam system. The net is being tightened even further: a vast array of new techniques have been evaluated in academic papers, and some have been taken into the community at large via open source products. The implementation of machine learning algorithms is likely to represent the next step in the ongoing fight to reclaim our inboxes.
references
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C. An evaluation of naive Bayesian anti-spam filtering. In: Proceedings of the workshop on machine learning in the new information age; 2000a. Androutsopoulos I, Koutsias J, Chandrinos K, Spyropoulos C. An experimental comparison of naive Bayesian and keywordbased anti-spam filtering with personal e-mail messages. In: SIGIR ’00: proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM Press; 2000b. p. 160–7. Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P. Learning to filter spam e-mail: a comparison of a naive Bayesian and a memory-based approach. In: Workshop on machine learning and textual information access, fourth European conference on principles and practice of knowledge discovery in databases (PKDD); 2000c. Boykin PO, Roychowdhury V. Personal email networks: an effective anti-spam tool. In: MIT spam conference; Jan 2005. Cain M. Spam blocking: what matters. META Group, http://www. postini.com/brochures; 2003. Carreras X, Ma´rquez L. Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, fourth international conference on recent advances in natural language processing, Tzigov Chark, BG; 2001.
computers & security 25 (2006) 566–578
S. Chhabra, W. Yerazunis, and C. Siefkes. Spam filtering using a Markov Random Field model with variable weighting schemas. In: Fourth IEEE international conference on data mining; 1–4 Nov 2004. p. 347–50. Chiu T. Anti-spam appliances are better than software. NetworkWorldFusion, http://www.nwfusion.com/columnists/ 2004/0301faceoffyes.html; 1 March 2004. Clark J, Koprinska I, Poon J. A neural network based approach to automated e-mail classification. In: Proceedings of IEEE/WIC international conference on web intelligence, 2003. WI 2003; 13–17 Oct 2003. p. 702–5. Cormack G, Lynam T. A study of supervised spam detection applied to eight months of personal e-mail, http://plg. uwaterloo.ca/wgvcormac/spamcormack.html; 1 July 2004. Cormack G, Lynam T. Spam corpus creation for TREC; 2005a. Cormack G, Lynam T. TREC 2005 spam track overview; 2005b. Cunningham P, Nowlan N, Delany S, Haahr M. A case-based approach to spam filtering that can track concept drift. In: ICCBR’03 workshop on long-lived CBR systems; June 2003. Daelemans W, Zavrel J, van der Sloot K, van den Bosch A. TiMBL: Tilburg memory based learner, version 3.0, reference guide, ILK, computational linguistics. Tilburg University, http://ilk. kub.nl/wilk/papers; 2000. Damiani E, De Capitani di Vimercati S, Paraboschi S, Samarati P. P2p-based collaborative spam detection and filtering. In: P2P ’04: proceedings of the fourth international conference on peer-to-peer computing (P2P’ 04). IEEE Computer Society; 2004a. p. 176–83. Damiani E, De Capitani di Vimercati S, Paraboschi S, Samarati P. Using digests to identify spam messages. Technical report. University of Milan; 2004b. Drucker H, Wu D, Vapnik VN. Support vector machines for spam categorization. IEEE Transactions on Neural Networks Sep. 1999;10(5):1048–54. Espiner T. Demand for anti-spam products to increase. UK: ZDNet; Jun 2005. Garcia FD, Hoepman J-H, van Nieuwenhuizen J. Spam filter analysis. In: Proceedings of 19th IFIP international information security conference, WCC2004-SEC, Toulouse, France. Kluwer Academic Publishers; Aug 2004. Golbeck J, Hendler J. Reputation network analysis for email filtering. In: Conference on email and anti-spam; 2004. Gomes LH, Cazita C, Almeida J, Almeida V, Meira Jr W. Characterizing a spam traffic. In: IMC ’04: proceedings of the fourth ACM SIGCOMM conference on Internet measurement. ACM Press; 2004. p. 356–69. Graham P. A plan for spam, http://paulgraham.com/spam.html; Aug 2002. Graham P. Better Bayesian filtering. In: Proceedings of the 2003 spam conference; January 2003. Graham-Cumming J. The spammers’ compendium, http://www. jgc.org/tsc/index.htm; Feb 2005. Gray A, Haahr M. Personalised, collaborative spam filtering. In: Conference on email and anti-spam; 2004. Hunt R, Cournane A. An analysis of the tools used for the generation and prevention of spam. Computers & Security 2004; 23(2):154–66. Ioannidis J. Fighting spam by encapsulating policy in email addresses. In: Network and distributed system security symposium; 6–7 Feb 2003. Jennings R. The global economic impact of spam, 2005 report. Technical report. Ferris Research; 2005. Zeller Jr T. Law barring junk e-mail allows a flood instead. The New York Times 1 Feb 2005. Lee H, Ng A. Spam deobfuscation using a hidden Markov model; 2005.
577
Leiba B, Ossher J, Rajan V, Segal R, Wegman M. SMTP path analysis; 2005. Levine J. Experiences with greylisting; 2005. Ludlow M. Just 150 ‘spammers’ blamed for e-mail woe. The Sunday Times 1 December 2002. Mail abuse prevention systems. Definition of spam, http://www. mail-abuse.com/spam_def.html; 2004. Marı´a J, Hidalgo G. Evaluating cost-sensitive unsolicited bulk email categorization. In: SAC ’02: proceedings of the 2002 ACM symposium on applied computing. ACM Press; 2002. p. 615–20. Nie N, Simpser A, Stepanikova I, Zheng L. Ten years after the birth of the Internet, how do Americans use the Internet in their daily lives? Technical report. Stanford University; 2004. Nutter R. Software or appliance solution?. NetworkWorldFusion, http://www.nwfusion.com/columnists/2004/0301nutter.html; 1 March 2004. O’Brien C, Vogel C. Spam filters: Bayes vs. chi-squared; letters vs. words. In: ISICT ’03: proceedings of the first international symposium on information and communication technologies. Dublin: Trinity College; 2003. Pantel P, Lin D. Spamcopda spam classification & organisation program. In: Learning for text categorization: papers from the 1998 workshop, Madison, Wisconsin. AAAI technical report WS-98-05; 1998. Pelletier L, Almhana J, Choulakian V. Adaptive filtering of spam. In: Second annual conference on communication networks and services research; 19–21 May 2004. p. 218–24. Postini Inc.. Postini perimeter manager makes encrypted mail easy and painless. Postini Inc., http://www.postini.com/ brochures; 2004. Process Software. Explanation of common spam filtering techniques (white paper). Process Software, http://www.process. com; 2004. Radicati Group. Anti-spam 2004 executive summary. Technical report. Radicati Group; 2004. Rigoutsos I, Huynh T. Chung-kwei: a pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In: Conference on email and anti-spam; 2004. Rios G, Zha H. Exploring support vector machines and random forests for spam detection. In: Conference on email and anti-spam; 2004. Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, Madison, Wisconsin. AAAI technical report WS-98-05; 1998. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. A memory-based approach to anti-spam filtering. Technical report, Tech Report DEMO 2001; 2001a. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P. Stacking classifiers for anti-spam filtering of e-mail. In: Empirical methods in natural language processing; 2001b. p. 44–50. Schneider K. Anti-spam appliances are not better than software. NetworkWorldFusion, http://www.nwfusion.com/columnists/ 2004/0301faceoffno.html; 1 March 2004. Siefkes C, Assis F, Chhabra S, Yerazunis W. Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of ECML/PKDD 2004. LNCS. Springer Verlag; 2004. Snyder J. Spam in the wild, the sequel, http://www. nwfusion.com/reviews/2004/122004spampkg.html; Dec 2004. Spira J. Spam e-mail and its impact on its spending and productivity. Technical report. Basex Inc.; 2003.
578
computers & security 25 (2006) 566–578
Vaughan-Nichols S. Saving private e-mail. IEEE Spectrum Aug 2003;40–4. Wagner M. Study: e-mail viruses up, spam down, http://www. internetweek.com/story/INW20021109S0002; 9 Nov 2002. Woitaszek M, Shaaban M, Czernikowski R. Identifying junk electronic email in Microsoft outlook with a support vector machine. In: 2003 Symposium on applications and the Internet; 27–31 Jan 2003. p. 166–9. Yerazunis W. Sparse binary polynomial hashing and the crm114 discriminator. In: MIT spam conference; 2003. Yoshida K, Adachi F, Washio T, Motoda H, Homma T, Nakashima A, Fujikawa H, Yamazaki K. Density-based spam detector. In: KDD ’04: proceedings of the 2004 ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press; 2004. p. 486–93.
Zdziarski J. Bayesian noise reduction: contextual symmetry logic utilizing pattern consistency analysis, http://www. nuclearelephant.com/papers/bnr.html; 2004.
James Carpinter completed his honours degree in computer science at the University of Canterbury in 2005 and now works as a software engineer for Unisys New Zealand. His research interests include machine learning and internet security. Ray Hunt is an Associate Professor specialising in Networks and Security. He is a member of the Department of Computer Science and Software Engineering at the University of Canterbury, New Zealand.
computers & security 25 (2006) 579–588
Expected benefits of information security investments Julie J.C.H. Ryana,*, Daniel J. Ryanb a
Department of Engineering Management and System Engineering, The George Washington University, Washington, DC 20052, USA Information Resources Management College, National Defense University, Washington, DC 20319, USA
b
article info
abstract
Article history:
Ideally, decisions concerning investments of scarce resources in new or additional proce-
Received 27 January 2005
dures and technologies that are expected to enhance information security will be informed
Revised 11 June 2006
by quantitative analyses. But security is notoriously hard to quantify, since absence of
Accepted 3 August 2006
activity challenges us to establish whether lack of successful attacks is the result of good security or merely due to good luck. However, viewing security as the inverse of risk en-
Keywords:
ables us to use computations of expected loss to develop a quantitative approach to mea-
Security
suring gains in security by measuring decreases in risk. In using such an approach, making
Information security
decisions concerning investments in information security requires calculation of net ben-
Attack probabilities
efits expected to result from the investment. Unfortunately, little data are available upon
Return-on-investment
which to base an estimate of the probabilities required for developing the expected losses.
Benefits of security investments
This paper develops a mathematical approach to risk management based on Kaplan–Meier and Nelson–Aalen non-parametric estimators of the probability distributions needed for using the resulting quantitative risk management tools. Differences between the integrals of these estimators evaluated for enhanced and control groups of systems in an information infrastructure provide a metric for measuring increased security. When combined with an appropriate value function, the expected losses can be calculated and investments evaluated quantitatively in terms of actual enhancements to security. ª 2006 Elsevier Ltd. All rights reserved.
1.
Introduction
Making decisions concerning investments in information security requires calculation of net benefits expected to result from the investment. Gordon and Loeb (2002) suggest that expected loss provides a useful metric for evaluating whether an investment in information security is warranted. They propose that since expected loss is the product of the loss v that would be realized following a successful attack on the systems comprising our information infrastructure and the probability that such a loss will occur, one way of accomplishing such calculations is to consider for an investment i the probabilities p0 and pi of the losses occurring with and without the investment, respectively. The expected net benefit of the investment i is, then,
ENB ½i ¼ p0 n pi n i ¼ p0 pi n i:
(1)
A positive expected net benefit characterizes an attractive investment opportunity. In addition to being subject to a number of simplifying assumptions, such as the loss being constant rather than a function of time, the ability to use this equation depends upon the ability to obtain probability distributions for information security failures. The notion of an information security ‘‘failure’’ is itself a concept that requires careful consideration. ‘‘Failure’’ may not necessarily mean the catastrophic destruction of information assets or systems. Failure in this context means some real or potential compromise of confidentiality, integrity or availability. An asset’s confidentiality can be compromised by illicit access even if the integrity and availability of the
* Corresponding author. E-mail addresses: [email protected] (J.J.C.H. Ryan), [email protected] (D.J. Ryan).
580
computers & security 25 (2006) 579–588
asset are preserved. Assets may have their integrity compromised even if their confidentiality and availability are unchanged. Obviously, destruction of an asset compromises its availability, even when confidentiality and integrity may be inviolate. Degradation of performance can be a form of failure, even when the system continues to operate correctly, albeit slowly. Successful installation of malicious code can be a form of information security failure even when the malicious code has not yet affected system performance or compromised the confidentiality, integrity or availability of information assets. Consequently, detection of failures for use in the models discussed in this paper requires that, as part of our experimental design, we carefully define what types of failures are to be considered. Nevertheless, it is only by examining failures that we can begin to understand the actual security states of our information infrastructures. Counting vulnerabilities we have patched or numbers of countermeasures implemented may provide evidence to indirectly reassure us that we are making progress in protecting our valuable information assets, but only metrics that measure failure rates can inform us how well we are actually abating risk. Security is the inverse of risk, and risk is measured by expected loss. Unfortunately, little data are available upon which to base an estimate of the probabilities of failure that are required for expected loss calculations (Ryan and Jefferson, 2003). Mariana Gerber and Rossouw von Solms (2001), in describing the use of quantitative techniques for calculation of annual loss expectancies, say dryly, ‘‘The only factor that was somewhat subjective was determining the likelihood of a threat manifesting.’’ Other authors explicitly or implicitly assume the availability of these probability distributions in models they propose to apply to risk management, cost-benefit or returnon-investment decisions (e.g., see Carroll, 1995; Ozier, 1999; Wei et al., 2001; Iheagwara, 2004; Cavusoglu et al., 2004, among others). John Leach (2003) says, ‘‘The data is there to be gathered but it seems we are not in the practice of gathering it. We haven’t started to gather it because we haven’t yet articulated clearly what questions we want the gathered data to answer.’’ This paper will develop a mathematical approach to risk management that will clarify the data that need to be collected, and will explore methods for non-parametric estimation of the probability distributions needed for using the resulting quantitative risk management tools.
If T is a non-negative random variable representing times of failure of individuals in a homogenous population, T can be specified using survivor functions, failure functions, and hazard functions. Both discrete and continuous distributions arise in the study of failure data. The survivor function is defined for discrete and continuous distributions by SðtÞ ¼ PrðT tÞ;
0 < t < N:
(2)
That is, S(t) is the probability that T exceeds a value t in its range. S(t) is closely related to several other functions that will prove useful in risk assessment and management, including the failure function F(t), which determines the cumulative probability of failure, and its associated probability density f(t), and the hazard function h(t), which provides the instantaneous rate of failure, and its cumulative function H(t). The way in which these functions relate is determined as follows. The failure function or cumulative distribution function associated with S(t) is FðtÞ ¼ PrðT < tÞ ¼ 1 SðtÞ:
(3)
S(t) is a non-increasing right continuous function of t with S(0) ¼ 1 and limt/N SðtÞ ¼ 0. The probability density function f(t) of T is f ðtÞ ¼
dFðtÞ d½1 SðtÞ dSðtÞ ¼ ¼ : dt dt dt
(4)
Now f(t) gives the density of the probability at t, and so, f ðtÞdzPrðt T < t þ dÞ ¼ SðtÞ Sðt þ dÞ:
(5)
for small values of d, providing that f(t) is continuous at t. Also, f(t) 0, Z N Z N f ðsÞds: (6) f ðtÞ ¼ 1; and SðtÞ ¼ 0
t
The hazard function of T is defined as hðtÞ ¼ limd/0þ Prðt T < t þ djT tÞ=d:
(7)
The hazard is the instantaneous rate of failure at t of individuals that have survived up until time t. From Eq. (7) and definition of f(t), hðtÞ ¼
f ðtÞ : SðtÞ
(8)
Integrating with respect to t " Z # t
2.
Failure time distributions
The mathematics of failure time distributions is explored thoroughly in several texts. (See Kalbfleisch and Prentice, 2002; Collett, 2003; Bedford and Cooke, 2001; Crowder, 2001; Therneau and Grambsch, 2000 for excellent coverage of the field.) In the case of information security, the exploration of time to failure is interesting in that it provides a contextual basis for investment decisions that is easily understood by managers and financial officers. Showing an economic benefit over a given period of time can be used to help them to understand investment strategies that take into account capital expenditures, risk mitigation, and residual risk in an operational environment.
SðtÞ ¼ exp
hðsÞds ¼ exp½HðtÞ;
(9)
0
Rt where HðtÞ ¼ 0 hðsÞds ¼ log SðtÞ is the cumulative hazard function. Then f ðtÞ ¼ hðtÞexp½HðtÞ:
(10)
If T is a discrete random variable, and takes on values a1 < a2 t
f aj :
(12)
581
computers & security 25 (2006) 579–588
The hazard function at ai is the conditional probability of failure at ai provided that the individual has survived to ai. hi ¼ PrðT ¼ ai jT ai Þ ¼
f ðai Þ ; where S a ¼ limt/ai SðtÞ: i S a i
(13)
Then the survivor function is SðtÞ ¼
Y
(14)
ð1 hi Þ;
jjai t
and the probability density function is f ðai Þ ¼ hi
i1 Y ð1 hi Þ:
(15)
j1
3. Empirical estimates of survivor and failure distributions Failure data are not generally symmetrically distributed, usually tending to be positively skewed with a longer tail to the right of the peak of observed failures. Hiromitsu Kumamoto and Ernest J. Henley (1996) show that if the data were not censored, we could use an empirical distribution function to model survivor, failure and hazard distributions. Let N be the number of individual systems in the study, and let n(t) be the number of failures occurring prior to time t. Then, nðt þ dÞ nðtÞ is the number of systems that can be expected to fail during the time interval [t, t þ d). N nðtÞ is the number of systems still operational at time t. ^ for the survivor function SðtÞ is An empirical estimator SðtÞ given by ^ ¼ Number of individuals operational at time t ¼ N nðtÞ: SðtÞ N Number of individual systems in the study (16) ^ for FðtÞ is So, an empirical estimator FðtÞ ^ ¼ nðtÞ: ^ ¼ 1 SðtÞ FðtÞ N
(17)
Then, we see that ^ þ dÞ FðtÞ ^ ^ dFðtÞ nðt þ dÞ nðtÞ Fðt f^ðtÞ ¼ ¼ x : d dt dN
(18)
^ for the For sufficiently small d, the empirical estimator hðtÞ instantaneous failure rate (hazard function) hðtÞ should approximately satisfy Number of failures during ½t; t þ dÞ Number of systems still operational at time t nðt þ dÞ nðtÞ ¼ N nðtÞ nðt þ dÞ nðtÞ N nðtÞ ¼ d dN N f^ðtÞd : ¼ ^ SðtÞ ^ ¼ f^ðtÞ = SðtÞÞ ^ , as we would expect from Eq. (8). So, hðtÞ Now, consider the notional system failure data given in Table 1. The entries in the table are the times, in days following initiation of operations, at which the computer systems we have been tracking were observed to fail due to attacks on information assets they are creating, storing, processing or communicating. Thus, two systems fail in the first day, a second fails on the 19th day, and so on until the final failure on the 126th day. Graphs of the empirical survivor and empirical failure functions are shown in Fig. 1. The data in Table 1 consist of failure times due to successful attacks on systems being studied. Usually, however, some of the systems entered into our studies either survive beyond the time allotted for the study and are still going strong when we cease to track them, or they fail during the study period for reasons unrelated to security – perhaps due to reliability problems not resulting from successful attacks, or due to human errors rather than attacks. In either case, studies that collect failure time data usually have such cases, and the systems that survive or that fail for reasons other than those related to the purpose of the study are said to be ‘‘right censored.’’ Other types of censoring can occur as well (see Collett, 2003, pp. 1–3; Kalbfleisch and Prentice, 2002, pp. 52–54). Unfortunately, the presence of censored data makes use of the empirical estimators impossible, because the definitions of the functions do not allow information provided by a system for which survival time is censored prior to time t to be used in calculating the functions at t. Fortunately, other estimators have been developed that can take advantage of censored data and provide valid and useful estimations of the functions we need in determining the advantage of a proposed investment. ^ ¼ hðtÞd
Table 1 – Notional computer system failure data 0.5 27.5 33 36 38.5 41 44.5 47 51 57
0.5 28.5 33 36 39.5 42 44.5 48 51 57
Values represent days.
19.5 29 33.5 37 39.5 42 44.5 48.5 52 59.5
20 29 33.5 37 39.5 42.5 45 48.5 52 61
23.5 29.5 34 37 40 42.5 45 48.5 53 100
23.5 30 34.5 37.5 40 43 46 48.5 53 102
25.5 30.5 34.5 37.5 40.5 43 46 49 53.5 109
25.5 30.5 34.5 38.5 40.5 43 46 50 54.5 121
27.5 31.5 35.5 38.5 40.5 43 46.5 50 55 121
27.5 33 35.5 38.5 41 44.5 46.5 50 56 126
582
computers & security 25 (2006) 579–588
Survival and Failure Distributions 1.2 1 0.8 S(t)
0.6
F(t)
0.4 0.2
128
119
111
93
102
85
76
68
59
51
42
34
25
17
8.5
0
0
Days Fig. 1 – Empirical survival and failure functions for system failure data.
4. Estimates of survival and failure distributions with censoring The 100 entries in Table 2 are notional times at which computer systems we have been tracking were observed to fail due to attacks on information assets they are creating, storing, processing or communicating. Thus, two systems fail on the first day, another fails on the 19th day, and so forth until six systems remain operational when the experiment is terminated on the 100th day. The six surviving systems are rightcensored. Systems that fail on the 27th, 28th, and the other entries marked as negative times, are also censored, representing failures due to causes unrelated to successful attacks such as, perhaps, reliability failures. We will explore two estimators that provide useful representations of the probability distributions underlying such data.
was known and used earlier. Let t0 represents the start of our study, and let t1 < t2 t*. Note that if there are no censored failure times in the data set, then nj dj ¼ njþ1 in Eq. (20), and SKM ðtÞ ¼
5.
j Y ni di ; ni i¼1
ni
¼
nkþ1 : n1
(21)
Since n1 ¼ N, and nkþ1 is the number of systems that survive beyond tkþ1, the Kaplan–Meier estimate of the survivor function is identical to the empirical estimator of the survival in the absence of censored data.
Table 2 – Notional computer system failure data 0.5 27.5 33 36 38.5 41 44.5 47 51 57
0.5 28.5 33 36 39.5 42 44.5 48 51 7
19.5 29 33.5 37 39.5 42 44.5 48.5 52 59.5
20 29 33.5 37 39.5 42.5 45 48.5 52 61
The sign indicates censoring. Values represent days.
23.5 29.5 34 37 40 42.5 45 48.5 53 100
23.5 30 34.5 37.5 40 43 46 48.5 53 100
25.5 30.5 34.5 37.5 40.5 43 46 49 53.5 100
25.5 30.5 34.5 38.5 40.5 43 46 50 54.5 100
27.5 31.5 35.5 38.5 40.5 43 46.5 50 55 100
27.5 33 35.5 38.5 41 44.5 46.5 50 56 100
583
computers & security 25 (2006) 579–588
The estimator for the failure function that corresponds to Eq. (20) is FKM ðtÞ ¼ 1 SKM ðtÞ:
(22)
It is natural to estimate the hazard function by the ratio of the number of systems that fail at a given time divided by the number at risk at that time. More formally, because HðtÞ ¼ log SðtÞ, we have HKM ðtÞ ¼ log SKM ðtÞ ¼
hKM ðtÞ z
j X
log
i¼1
ni di ; for tj t < tjþ1 : ni
and
(23)
HKM tj HKM tj1 dj ; for tj t < tjþ1 : z tjþ1 tj nj tjþ1 tj
(24)
This equation can be applied to all intervals except the interval that begins at t , since that interval is infinite in length. k Returning our attention to the data in Table 2, we set out the calculations in Table 3 for the Kaplan–Meier estimators of the survivor, failure and hazard functions. Fig. 2 shows the Kaplan–Meier estimator for the survivor function.
Table 3 – Kaplan–Meier calculations j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
tj
tjþ1 tj
dj
cj
(nj dj)/nj
SKM(t)
FKM(t)
hKM(t)
fKM(t)
0 0.5 19.5 20 23.5 25.5 27.5 28.5 29 29.5 30 30.5 31.5 33 33.5 34 34.5 35.5 36 37 37.5 38.5 39.5 40 40.5 41 42 42.5 43 44.5 45 46 46.5 47 48 48.5 49 50 51 52 53 53.5 54.5 55 56 57 59.5 61
0.5 19 0.5 3.5 2 2 1 0.5 0.5 0.5 0.5 1 1.5 0.5 0.5 0.5 1 0.5 1 0.5 1 1 0.5 0.5 0.5 1 0.5 0.5 1.5 0.5 1 0.5 0.5 1 0.5 0.5 1 1 1 1 0.5 1 0.5 1 1 2.5 1.5
0 2 1 1 2 2 2 1 1 1 1 2 1 3 1 1 3 2 1 3 2 4 3 2 3 1 2 2 4 4 1 3 2 1 1 4 1 3 1 2 1 1 1 1 1 2 1 1
0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
1 0.9800 0.9897 0.9896 0.9789 0.9785 0.9780 0.9886 0.9885 0.9882 0.9881 0.9759 0.9877 0.9615 0.9867 0.9863 0.9583 0.9706 0.9848 0.9524 0.9667 0.9310 0.9444 0.9608 0.9388 0.9783 0.9545 0.9524 0.9000 0.8857 0.9677 0.8966 0.9231 0.9583 0.9565 0.8182 0.9444 0.8235 0.9286 0.8333 0.9000 0.8750 0.8571 0.8333 0.8000 0.5000 0.5000 0.0000
1 0.9800 0.9699 0.9598 0.9396 0.9194 0.8992 0.8890 0.8787 0.8684 0.8581 0.8374 0.8270 0.7952 0.7846 0.7739 0.7416 0.7198 0.7089 0.6752 0.6527 0.6076 0.5739 0.5514 0.5176 0.5064 0.4834 0.4603 0.4143 0.3670 0.3551 0.3184 0.2939 0.2816 0.2694 0.2204 0.2082 0.1714 0.1592 0.1327 0.1194 0.1045 0.0895 0.0746 0.0597 0.0298 0.0149 0.0000
0 0.0200 0.0301 0.0402 0.0604 0.0806 0.1008 0.1110 0.1213 0.1316 0.1419 0.1626 0.1730 0.2048 0.2154 0.2261 0.2584 0.2802 0.2911 0.3248 0.3473 0.3924 0.4261 0.4486 0.4824 0.4936 0.5166 0.5397 0.5857 0.6330 0.6449 0.6816 0.7061 0.7184 0.7306 0.7796 0.7918 0.8286 0.8408 0.8673 0.8806 0.8955 0.9105 0.9254 0.9403 0.9702 0.9851 1.0000
0.0000 0.0011 0.0206 0.0030 0.0106 0.0109 0.0225 0.0227 0.0233 0.0235 0.0238 0.0244 0.0082 0.0769 0.0263 0.0267 0.0417 0.0571 0.0147 0.0923 0.0317 0.0678 0.1071 0.0741 0.1176 0.0204 0.0851 0.0889 0.0650 0.2162 0.0286 0.1875 0.1333 0.0345 0.0714 0.3333 0.0435 0.1500 0.0556 0.1250 0.1429 0.0769 0.1667 0.0909 0.1000 0.1000 0.0952 0.0000
0.0000 0.0011 0.0200 0.0029 0.0100 0.0100 0.0202 0.0202 0.0205 0.0204 0.0204 0.0204 0.0068 0.0612 0.0206 0.0207 0.0309 0.0411 0.0104 0.0623 0.0207 0.0412 0.0615 0.0409 0.0609 0.0103 0.0411 0.0409 0.0269 0.0793 0.0102 0.0597 0.0392 0.0097 0.0192 0.0735 0.0091 0.0257 0.0089 0.0166 0.0171 0.0080 0.0149 0.0068 0.0060 0.0030 0.0014 0.0000
584
computers & security 25 (2006) 579–588
The Kaplan-Meier Estimate of the Survivor Function 1.2 1 0.8 0.6
S(t)
0.4 0.2
90 96
78 84
66 72
48 54 60
30 36 42
18 24
0
6 12
0
Time in days Fig. 2 – The Kaplan–Meier estimator for the survivor function underlying the data in Table 2.
6.
Nelson–Aalen estimators
The Nelson–Aalen estimator was first proposed by Nelson (1969, 1972). Altshuler (1970) also derived the estimator. It has been shown that the Nelson–Aalen estimate of the survivor function is always greater than the Kaplan–Meier estimate at a specified time. For small samples, the Nelson– Aalen estimator is better than the Kaplan–Meier estimator (Collett, 2003, p. 22). The Nelson–Aalen estimator is given by SNA ðtÞ ¼
k Y j¼1
exp dj =nj :
(25)
Since expðdj =nj Þz1 ðdj =nj Þ ¼ ðnj dj Þ=nj , whenever dj is small compared to nj, which it is except toward the end of the study, the Kaplan–Meier estimate given by Eq. (20) closely approximates the Nelson–Aalen estimate in Eq. (25). Then, as usual, we can obtain the Nelson–Aalen failure function directly from the survivor estimator: ~ FNA ðtÞ ¼ 1 SðtÞ:
(26)
The cumulative hazard H(t) at time t is, by definition, the integral of the hazard function. Because HðtÞ ¼ log SðtÞ, we find HNA ðtÞ ¼ log SNA ðtÞ ¼
r X dj j¼1
nj
dj ; nj tjþ1 tj
We hope and expect, of course, that an investment in information security will provide us with some advantages. If it did not, we would be foolish to make the investment. More specifically, we expect that an investment in information security will result in greater freedom from successful attacks on the systems being protected, so that the systems survive longer before succumbing to an attack. Since no security is perfect, we know that eventually the systems will succumb, but our investment should delay that time. The result of an investment should, then, be to move the survivor curve to the right. Suppose that we track 100 systems which are protected by an investment in additional information security that mirrors the investment proposed for our information infrastructure. This might occur contemporaneously with the gathering of the data in Table 2, so that the systems represented in Table 2 represent a control group. Thus, in the enhanced group, one system fails on the first day, another fails on the fifth day, and so forth until 34 systems remain operational when the experiment is terminated on the 100th day. The 34 surviving systems are right-censored. Systems that fail on the 65th, 93rd, and the other entries marked as negative times, are also censored, representing failures due to causes unrelated to successful attacks such as, perhaps, reliability failures. Tracking the enhanced systems might also take place later, and could even be the same systems as were followed in preparing Table 2, suitably repaired following the attacks that provided the data in Table 2 and enhanced according to the proposed investment. In any event, suppose that following protection of the new set of systems with the improved security, we observe the attacks described by the failure times in Table 5. Fig. 4 shows the survivor curve S0 that we experience without the investment (from Table 2), and the survivor curve Si that occurs following the investment (from Table 5). The benefit produced by our investment is the area between the curves. Thus, the advantage we gain from our investment i is given by Z N Z N S0 ðtÞ ¼ Ei ½T E0 ½T; (29) Si ðtÞ AðiÞ ¼ 0
;
(27)
the cumulative sum of estimated probabilities of failure in the first r time intervals. Since the differences between adjacent values of HNA ðtÞ are estimates of the hazard function hNA ðtÞ after being divided by the time interval (Collett, 2003, p. 33). Thus, hNA ðtÞ ¼
7. The advantage of an investment in information security
(28)
exactly as in Eq. (24). The calculations for the Nelson–Aalen estimators for the data from Table 2 are shown in Table 4, and a graph of the survivor function is shown in Fig. 3. To obtain the probability density functions associated with the Kaplan–Meier and Nelson–Aalen estimators, we can use Eq. (8), or Eq. (16).
0
since, as is well known, Z N Z N SðtÞdt: tf ðtÞdt ¼ E½T ¼ 0
(30)
0
Of course, it may be that Si ðtÞ S0 ðtÞ is not always true. The two curves may cross one or more times, but if AðiÞ > 0 the benefits of the investment will eventually outweigh any short term detriments. If, however, we restrict our attention to the near term, say one year or a few years, it is possible that Rt Rt AðiÞ > 0 but the value of 0 Si ðtÞ 0 S0 ðtÞ is less than zero for t restricted to the period of interest. Eq. (29) is a useful metric for measuring the advantage of an investment, but it still fails to address the impact of a successful attack on those information assets that have their confidentiality, integrity or availability compromised at the observed failure times. We know that the impact of
585
computers & security 25 (2006) 579–588
Table 4 – Nelson–Aalen calculations j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
tj
dj
nj
exp(dj/nj)
SNA(t)
FNA(t)
hNA(t)
fNA(t)
0 0.5 19.5 20 23.5 25.5 27.5 28.5 29 29.5 30 30.5 31.5 33 33.5 34 34.5 35.5 36 37 37.5 38.5 39.5 40 40.5 41 42 42.5 43 44.5 45 46 46.5 47 48 48.5 49 50 51 52 53 53.5 54.5 55 56 57 59.5 61
0 2 1 1 2 2 2 1 1 1 1 2 1 3 1 1 3 2 1 3 2 4 3 2 3 1 2 2 4 4 1 3 2 1 1 4 1 3 1 2 1 1 1 1 1 2 1 1
100 98 97 96 94 92 89 88 86 85 84 82 81 78 76 75 72 70 68 65 63 59 56 54 51 49 47 45 41 37 35 32 30 29 28 24 23 20 18 16 14 13 12 11 10 8 7 6
1.0000 0.9798 0.9897 0.9896 0.9789 0.9785 0.9778 0.9887 0.9884 0.9883 0.9882 0.9759 0.9877 0.9623 0.9869 0.9868 0.9592 0.9718 0.9854 0.9549 0.9688 0.9345 0.9478 0.9636 0.9429 0.9798 0.9583 0.9565 0.9070 0.8975 0.9718 0.9105 0.9355 0.9661 0.9649 0.8465 0.9575 0.8607 0.9460 0.8825 0.9311 0.9260 0.9200 0.9131 0.9048 0.7788 0.8669 0.8465
1.0000 0.9798 0.9697 0.9597 0.9395 0.9193 0.8989 0.8887 0.8784 0.8682 0.8579 0.8372 0.8269 0.7957 0.7853 0.7749 0.7433 0.7224 0.7118 0.6797 0.6585 0.6153 0.5832 0.5620 0.5299 0.5192 0.4976 0.4759 0.4317 0.3875 0.3766 0.3429 0.3207 0.3099 0.2990 0.2531 0.2423 0.2086 0.1973 0.1741 0.1621 0.1501 0.1381 0.1261 0.1141 0.0889 0.0770 0.0652
0.0000 0.0202 0.0303 0.0403 0.0605 0.0807 0.1011 0.1113 0.1216 0.1318 0.1421 0.1628 0.1731 0.2043 0.2147 0.2251 0.2567 0.2776 0.2882 0.3203 0.3415 0.3847 0.4168 0.4380 0.4701 0.4808 0.5024 0.5241 0.5683 0.6125 0.6234 0.6571 0.6793 0.6901 0.7010 0.7469 0.7577 0.7914 0.8027 0.8259 0.8379 0.8499 0.8619 0.8739 0.8859 0.9111 0.9230 0.9348
0.0000 0.0011 0.0206 0.0030 0.0106 0.0109 0.0225 0.0227 0.0233 0.0235 0.0238 0.0244 0.0082 0.0769 0.0263 0.0267 0.0417 0.0571 0.0147 0.0923 0.0317 0.0678 0.1071 0.0741 0.1176 0.0204 0.0851 0.0889 0.0650 0.2162 0.0286 0.1875 0.1333 0.0345 0.0714 0.3333 0.0435 0.1500 0.0556 0.1250 0.1429 0.0769 0.1667 0.0909 0.1000 0.1000 0.0952 0.0000
0.0000 0.0011 0.0200 0.0029 0.0100 0.0100 0.0202 0.0202 0.0205 0.0204 0.0204 0.0204 0.0068 0.0612 0.0207 0.0207 0.0310 0.0412 0.0105 0.0627 0.0209 0.0417 0.0625 0.0416 0.0623 0.0106 0.0423 0.0423 0.0281 0.0838 0.0108 0.0643 0.0427 0.0107 0.0213 0.0844 0.0105 0.0313 0.0110 0.0218 0.0232 0.0115 0.0230 0.0115 0.0114 0.0089 0.0073 0.0000
a successful attack on an information asset varies over time. A compromise of the confidentiality of the planned landing on Omaha Beach on D-day would have had enormous impact on June5, 1944, and a negligible impact on June 7th. The impact of a successful attack on integrity is much larger following creation and prior to making and storing a backup of an information asset than the impact that would result from a compromise after the backup is safely stored. Such a collapse of the loss function is characteristic of information security, although whether the decline takes place very rapidly, as in the D-day example, or degrades gradually over time, depends upon circumstances.
Alternatively, consider the loss functions associated with an organization that keeps customer accounts in a database that is updated once per day, say at midnight, based on a transactions file accumulated throughout the day. The loss function for the database itself is constant throughout the day: 0; prior to midnight and the creation of the database n ¼ n; between midnight and the following midnight : 0; after midnight when the database has been replaced by an updated database (31) If a duplicate copy of the accounts database is made concurrently with the following midnight’s update, and stored
586
computers & security 25 (2006) 579–588
The Nelson-Aalen Estimate of the Survivor Function 1.2
E0 ½nðtÞEi ½nðtÞ> 0, and the expected net benefit is
Of course, if nðtÞ ¼ n is constant, then
0.8
E0 ½nðtÞ Ei ½nðtÞ ¼ E0 ½n Ei ½n ¼ n n ¼ 0;
0.6
S(t)
0.4 0.2
84 90 96
66 72 78
48 54 60
36 42
24 30
12 18
6
0 0
(34)
BðiÞ ¼ E0 ½nðtÞ Ei ½nðtÞ i:
1
Time in days Fig. 3 – The Nelson–Aalen estimator for the survivor function underlying the data in Table 2.
so BðiÞ ¼ i, confirming our intuition that no security is ever perfect and eventually a compromise will occur. But, if our investment i is such that the survivor curve moves sufficiently far to the right, then the occurrence of a successful attack could be delayed beyond the collapse of the loss function. As Si ðtÞ moves to the right, so does fi(t). If tc is the time at which the loss function collapses, then, since nðtÞ ¼ 0 for t > tc , Z tc Z N nðtÞf0 ðtÞdt nðtÞf0 ðtÞdt E0 ½nðtÞ Ei ½nðtÞ ¼ 0
"Z
tc
tc
securely offsite, the exposure for destruction or corruption of the accounts database can be as little as the cost of recovering and installing the backup – much less than the cost of creating the database from scratch. The loss function for the transaction file, on the other hand, is a sawtooth function, the value of which is zero at midnight and increases monotonically as transactions accumulate throughout the day, but which returns to zero value the following midnight when the transactions are used to update the accounts database and a new transactions file is created for the following day. The loss function nðtÞ for all the information assets contained in our information infrastructure is, of course, a sum of the individual loss functions for each asset. Our estimation processes have provided us with the probability densities f0 and fi we need, so, given a loss function nðtÞ representing the loss we would experience from a successful attack on our unimproved infrastructure as a function of time, the expected loss at t without the proposed investment is Z N E0 ½n ¼ nðtÞf0 ðtÞdt; (32) 0
and, similarly, for the expected loss following making the proposed investment in information security, Z N nðtÞfi ðtÞdt: (33) Ei ½n ¼ 0
Since we expect the loss to be less following our investment, the expected benefit from our investment is
¼
Z
nðtÞfi ðtÞdt 0 tc
nðtÞf0 ðtÞdt 0
Z
Z
N
nðtÞfi ðtÞdt tc
#
tc
nðtÞfi ðtÞdt :
ð35Þ
0
Then, if fi has moved to the right, making fi(t) small when t < tc, the second integral is small, and BðiÞ ¼ E0 ½nðtÞ Ei ½nðtÞ i Ztc z nðtÞf0 ðtÞdt i:
ð36Þ
0
Thus, the collapse of the loss function will make our investment worthwhile if i is less than the near-term loss expected if the investment is not made.
8.
Future research
To enable the use of expected loss in these mathematical models for evaluation of proposed investments in information security, and the use of the metrics in tracking the evolution of security in information infrastructures as investments are implemented, more research is needed. Studies of failure time data can involve epidemiological studies of the entire information infrastructure, or can use a cross-sectional study of a representative sub-infrastructure based on concurrent measurements of samples of individual systems followed prospectively (a cohort study), or of several
Table 5 – Notional computer system failure data following an investment i in information security 0.5 65.5 75.5 81 87.5 93 98 100 100 100
5.5 67.5 77 82.5 87.5 93 98 100 100 100
26.5 68.5 77 82.5 87.5 93.5 99 100 100 100
57 70.5 78 84.5 89 93.5 99 100 100 100
The sign indicates censoring. Values represent days.
57.5 72.5 78 84.5 89 94.5 99 100 100 100
60 72.5 79 84.5 89 94.5 99 100 100 100
60 72.5 80 85.5 91 96 100 100 100 100
62 72.5 80 85.5 91.5 96 100 100 100 100
62 72.5 80.5 86 91.5 96 100 100 100 100
65.5 75 81 86 91.5 97.5 100 100 100 100
computers & security 25 (2006) 579–588
Survivor Functions Before and After an Investment i 1.2 S0(t)
Probability
1
Si(t)
0.8
587
expected values to measure the advantage that will be realized from the investment. We will then be able to make an informed decision as to whether a proposed investment is wise. Implemented in an operational environment, this method can change the way security investments are considered and made.
0.6
references 0.4 0.2
97 .5
82 .5 90
67 .5 75
52 .5 60
37 .5 45
22 .5 30
0
7. 5 15
0
Time Fig. 4 – The survivor function S0(t) before an investment and Si(t) following an investment i in Information security.
samples with retrospective measurements (a case–control study). Alternatively, we can use a randomized controlled trial to study two or more parallel, randomized cohorts of systems, one of which (the control group) receives no security enhancements, and the others of which are enhanced by a proposed investment or investments in information security. Such studies should be undertaken to evaluate the overall utility of various approaches to using failure time data in an information security environment, and addressing a wide variety of different practices and technologies to determine if this approach is more effective in some settings or for some types of practices or technologies.
9.
Conclusion
Too often information security investment decisions are based on criteria that are at best qualitative, and at worst little more than fear, uncertainty and doubt derived from anecdotal evidence. Security is the inverse of risk. Because security is at its best when nothing happens, it is notoriously difficult to measure. But we can measure risk by calculating expected loss, and a reduction in expected loss is a measure of the change in security posture that accrues to an information infrastructure following an investment in a new security practice or technology. Unfortunately, there are little data available to allow us to understand the probabilities of successful attacks that we need in order to calculate expected loss. We sought a way to obtain the probabilities needed so we can actually use the expected net benefit Eq. (1). By collecting data on the experiences of two separate populations, one which has the benefit of the investment and the other – a control group – which does not, we can compute the Kaplan– Meier or Nelson–Aalen estimates of the probability density functions with and without the investment. Having the probabilities, and knowing the value of our information assets, we can calculate the respective expected losses, and to use those
Altshuler B. Theory for the measurement of competing risks in animal experiments. Mathematical Bioscience 1970;6:1–11. Bedford Tim, Cooke Roger. Probabilistic risk analysis: foundations and methods. Cambridge, UK: Cambridge University Press; 2001. Carroll John M. Information security risk management. In: Hutt Arthur, et al., editors. Computer security handbook. 3rd ed. NY: John Wiley & Sons; 1995. p. 3.1–320. Cavusoglu Huseyin, Mishra Birendra, Raghunathan Srinivasan. A model for evaluating it security investments. Communications of the ACM 2004;47(7):87–92. Collett David. Modelling survival data in medical research. 2nd ed. Boca Raton: Chapman & Hall/CRC; 2003. Crowder Martin. Classical competing risks. Washington, DC: Chapman & Hall/CRC; 2001. Gordon Lawrence A, Loeb Martin P. The economics of information security investment. ACM Transactions on Information and System Security November 2002;5(4):438–57. Gerber Mariana, von Solms Rossouw. From risk analysis to security requirements. Computers and Security 2001; 20:580. Iheagwara Charles. The effect of intrusion detection management methods on the return on investment. Computers and Security 2004;23:213–28. Kalbfleisch John D, Prentice Ross L. The statistical analysis of failure time data. 2nd ed. Hoboken, NJ: John Wiley & Sons; 2002. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 1958;53:457–81. Kumamoto Hiromitsu, Henley Ernest J. Probabilistic risk assessment for engineers and scientists. 2nd ed. New York: IEEE Press; 1996. p. 266ff. Leach John. Security engineering and security ROI. p. 2, ; 2003 [accessed 7/22/2004]. Nelson Wayne. Hazard plotting for incomplete failure data. Journal of Quality Technology 1969;1:27–52. Nelson Wayne. Theory and applications of hazard plotting for censored failure data. Technometrics 1972;14:945–65. Ozier Will. Risk analysis and assessment. In: Tipton Harold F, Krause Micki, editors. Information security management handbook. 4th ed. Boca Raton: Auerbach; 1999. p. 247–85. Ryan Julie J.C.H., Jefferson Theresa I. The use, misuse, and abuse of statistics in information security research. Managing technology in a dynamic world. In: Proceedings of the 2003 American society for engineering management conference, St. Louis, Missouri, October 15–18, 2003. p. 644–53. Therneau Terry M, Grambsch Patricia M. Modeling survival data: extending the Cox model. New York: Springer; 2000. Wei Huaqiang, Frinke Deb, Carter Olivia, Ritter Chris, 2001. Cost benefit analysis for network intrusion detection systems. In: CSI 28th annual computer security conference, Washington, DC, [accessed 7/22/2004].
588
computers & security 25 (2006) 579–588
Julie JCH Ryan is a member of the faculty at George Washington University in Washington, DC. Earlier she served as President of the Wyndrose Technical Group, Inc., a company providing information technology and security consulting services. She was also a Senior Associate at Booz Allen & Hamilton and a systems analyst for Sterling Software. In the public sector she served as an analyst at the Defense Intelligence after beginning her career as an intelligence officer after graduating from the Air Force Academy. She holds a Masters from Eastern Michigan University and a D.Sc. from George Washington University.
Daniel J. Ryan is a Professor at the Information Resources Management College of the National Defense University in Washington, DC. Prior to joining academia, he served as Executive Assistant to the Director of Central Intelligence, and still earlier as Director of Information Security for the Office of the Secretary of Defense. In the private sector, he served as a Corporate Vice President of SAIC and as a Principal at Booz Allen & Hamilton. He holds a Masters in mathematics from the University of Maryland, an MBA from California State University, and a JD from the University of Maryland.
computers & security 25 (2006) 589–599
A virtual disk environment for providing file system recovery5 Jinqian Lianga,*, Xiaohong Guana,b a
Center for Intelligent and Networked Systems and National Lab for Information Science and Technology, Tsinghua University, Beijing 100084, China b Ministry of Education Key Lab for Intelligent Networks and Network Security and State Key Laboratory for Manufacturing Systems, Xi’an Jiaotong University, Xi’an 710049, China
article info
abstract
Article history:
File system recovery (FSR) is a kind of recovery facility that allows users to roll back the file
Received 12 April 2005
system state to a previous state. In this paper, we present a virtual disk environment (VDE)
Revised 25 June 2006
which allows previous write operations to a hard disk to be undone, and previous version
Accepted 3 August 2006
of files to be recovered. It can be used to recover the file system quickly even when computer system suffers the serious disaster such as system crash or boot failure. The VDE
Keywords:
is same as virtual disk in the virtual machine (VM) environment in some way, but it can
File system recovery
be applied to the environment without VM supports. Algorithms for implementing the
Data protection
VDE are presented and its implementation on Windows platform is discussed. Based on
Block-based
the implementation, the experimental results of the VDE performance are analyzed. Com-
Filter driver
paring with other FSRs, the main advantage of the VDE is low overhead and high recovery
Virtual disk environment
speed.
File system
1.
Introduction
With the improving of computer hardware stability, more and more computer errors are from software faults. Software faults are often aroused by error operations, malicious code attacks and software conflictions, etc. These faults greatly influence computer use. Business interruptions and loss of productivity caused by system failure can be damaging and expensive. Thus, we need a recovery facility to restore the file system in case it corrupts. The issue of undo has been studied by several authors over many years (Archer et al., 1984; Prakash and Knister, 1994;
5
ª 2006 Elsevier Ltd. All rights reserved.
Washizaki and Fukazawa, 2002). Indeed, the ability to undo operations has become a standard and useful feature in most interactive applications (Kim et al., 1999). For instance, the availability of an undo facility in editors and DBMSs is useful for reversing erroneous actions. As undo operations to an application, FSR is a kind of undo operation to the software system including operating system and its applications. FSR can be applied to many fields that need to recover the file system state to a special state. It can also help reduce user frustration with new systems and encourage users to experiment. FSR can be classified into two categories: static FSR and dynamic FSR. The static FSR provides protection for a file
The research presented in this paper is supported in part by the National Outstanding Young Investigator Grant (6970025), National Natural Science Foundation (60243001, 60574087) and 863 High Tech Development Plan (2003AA142060) of China. * Corresponding author. E-mail addresses: [email protected] (J. Liang), [email protected] (X. Guan).
590
computers & security 25 (2006) 589–599
system through full or incremental backup at an interval time, and restores the file system with these backup data when user wants to recover the file system. However, the static FSR cannot provide real time protection for a file system, and it generally consumes significant capacity on the backup medium and takes more time to restore. The famous Norton Ghost (Symantec) is an example of the static FSR. The dynamic FSR has more advantages than the static FSR except it is more complex than the static FSR, and it provides not only real time protection for the file system but also takes less space than the static FSR. In this paper, we only take account of the dynamic FSR, in the following presentations, FSR means the dynamic FSR except explicit explanation. FSR, as a kind of undo operation on computer system, often involves with backup and restore operations, these operations not only have serious influence on system performance, but also lead to low recovery speed. The VDE, as a dynamic FSR, does not backup disk data while it protects the hard disk state and thus no need to restore the hard disk data when it recovers the disk state. Therefore, the VDE can recover a disk state in a very high speed. For example, it only takes less than 15 s to recover a hard disk with 60 GB capacity. This feature makes the VDE very suitable to use in those fields where system is easy to suffer attacks or destroys and need to recover quickly. Consider the common case where users have to have privileged system access in order to run poorly written applications (all too common), but this creates support problems as some users will inevitably install screen-savers, tweak the OS load games, etc., generally creating a support nightmare. The VDE would allow the user the freedom to play, yet the machine will be returned to standard build either at the end of the day or when the support group reset the VDE’s index tables. On the privacy front, in the SOHO world, the VDE offers users the potential to be 100% certain that they have deleted all cookies, temporary files and other metadata that build up on machines, even the assurance that any spyware that has been picked up can be flushed in a simple way. Despite a large amount of research efforts on computer security, there is no such thing as an unbreakable system. An information system may be damaged due to malicious attacks or honest human errors, and it is fundamentally difficult either to stop all security breaches or to completely prevent mistakes. So the next best thing one can hope for is to restore the damaged systems back to their functional state as soon as possible. Thus, instead of aiming at an absolutely secure system with infinite mean time between breaches, one should strive to reduce the mean time to repair system. Our solution provides a way to this target. However, the VDE presented in this paper only provides a single recovery point, it restores the file system only to a single known state at a pre-specified time, and it cannot roll back file system to any named point of time. The rest of the paper is organized as follows. Section 2 gives a brief description of related work dealing with file system protection, and shows the advantages that our solution offers over other protection solutions. Section 3 presents an overview of data protection and their characters. Section 4 provides the algorithms to implement data protection for
the hard disk. Section 5 describes the implementation of the VDE on Windows platform. Section 6 analyzes the performance and presents the experimental results from the disk micro-benchmarks on a partition, followed by conclusions in Section 7.
2.
Related work
FSR is not a new technology, and it has been studied and developed in many fields, such as fault-tolerant parallel system (Pei et al., 2000), large scale servers and databases (David and Lomet, 2000), etc. In recent years, some cost-effective FSR solutions for PC world have been developed. There are many file systems providing certain levels of FSR, for example, Windows NTFS file system supports FSR through log-based recovery (Microsoft Corporation). Journaling file systems (Seltzer et al., 2000; Stein et al., 2001; Piernas et al., 2002) are the representative file systems that support FSR. A journaling file system (JFS) is a file system that contains its own backup and recovery capability. Before indexes on disk are updated, the information about the changes is recorded in a log. The JFS maintains a log, or journal, of what activity has taken place in the main data areas of the disk. The JFS either commits a change to the log or can roll it back in a transactional manner, much like an RDBMS. Using database journaling techniques, the JFS can restore a file system to a consistent state quickly. If a crash occurs, any lost data can be recreated because updates to the metadata in directories and bit maps have been written to a serial log. The JFS not only returns the data to the pre-crash configuration but also recovers unsaved data and stores it in the location it would have been stored in if the system had not been unexpectedly interrupted. There are many operating systems that support JFS, such as Apple’s HFS, Linux Ext3, ReiserFS, XFS, and JFS, Solaris UFS, etc. However, the recovery ability of a file system depends on its metadata, if the metadata has been destroyed or system boot fails, the file system will lose its recovery ability. In fact, viruses or malicious code can destroy a file system easily by writing storage media directly without passing the file system. The Repairable File System (RFS) project (Zhu et al., 2003) aims at improving the speed and precision of post-intrusion damage repair for NFS servers. RFS maintains file system operation logs and carries out dependency analysis to provide fast and accurate repair of damage caused by NFS operations issued by attackers. To help avoid unplanned outages, Microsoft introduces System Restore features in Windows XP and other platforms (Microsoft Corporation). System Restore is a component of Windows which allows a user to restore a computer to a previous state without losing personal data files (such as Microsoft Word documents, browsing history, drawings, favorites, or email) in case a system problem is encountered. System Restore monitors changes to the system and some application files, and it automatically creates easily identified recovery points. These restore points allow user to revert the system to a previous time. They are created daily and at the time of significant system events (such as when an application or
computers & security 25 (2006) 589–599
driver is installed). Users can also create and name their own restore points at any time. However, if the system becomes unbootable, System Restore will not work anymore, and it must resort to an emergency boot disk to restore the system. Moreover, System Restore is a file-based scheme, and it only provides partial protection for the file system, it cannot provide protection for those operations which bypass the file system (reference direct-to-disk operation module in Fig. 4(a)). If malicious code destroys the file system through writing disk directly (this is the common method used by malicious code), System Restore will lose recovery ability. IBM Rescue and Recovery is an another FSR solution that includes a set of self recovery tools to help users recover from a software crash, even if the primary operating system will not boot (IBM). It creates a hidden partition on the hard disk to backup data and installs a refined Linux as an assistant operating system. The first backup is a base image, and subsequent backups are differential. It uses the assistant operating system as a recovery platform, even the primary operating system boot fails, and it still has ability to recover the system. However, it requires more hard disk space to save the assistant operating system and backup data. FSR technology has been used in many Virtual Machine Monitors (VMM) (VMware Inc.; Microsoft; Lawton and Denney, 2003; Qemu cpu emulator). The VMM is responsible for managing the hardware and other resources below while providing an interface for virtual machine instances above. The interface below the VMM, called the platform-VMM interface, could include bare hardware (called direct-on-hardware VM), an operating system, or any combination of the two (called host-based VM). With the help of storage virtualization, which often exists in a form of virtual disks, it is easy to implement undo feature in virtual machine environments. VMware (VMware Inc.) makes use of snapshot to provide FSR feature. The snapshot captures the entire state of the virtual machine at the time of taking the snapshot. This includes: the state of all the virtual machine’s disks (in certain special purpose configurations, one or more of the virtual machine’s disks can be excluded from the snapshot); the contents of the virtual machine’s memory and the virtual machine settings. When reverting to the snapshot, VMware discards all changes made to the virtual machine since taking the snapshot and return all these items to the state they were in at the time taking the snapshot. Microsoft’s Virtual PC (Microsoft) is a product very similar to what is offered by VMware Workstation. It is based on the Virtual Machine Monitor (VMM) architecture and lets the user create and configure one or more virtual machines. Apart from the features supported by VMware, it provides one distinguishing functionalities. It maintains an undo disk that lets the user easily undo some previous operations on the hard disks of a VM. This enables easy data recovery and might come handy in several circumstances (Nanda, 2005). Bochs is an open source VMM, it provides a disk rollback function through undoable disks and a file named redolog. Undoable disks are commitable/rollbackable disk images. An undoable disk is based on a read-only flat image, associated with a growing redolog, that contains all changes (writes) made to the flat image content. The redolog is dynamically created at runtime, if it does not previously exist. All writes
591
go to the redolog, reads are done from the redolog if previously written, or from the flat file otherwise. After a run, the redolog will still be present, so the changes are still visible in the next time, and the redolog can be committed (merged) to the flat image with a tool bxcommit. The redolog can be roll backed (discarded) by simply deleting the redolog file, in this case, all changes to the disks will be discarded and the disk state will be in the previous state (Lawton and Denney, 2003). Unlike the traditional VMMs, Ventana (Pfaff et al., 2006), a virtualization aware distributed file system, provides the powerful versioning, security, and mobility properties of virtual disks, while overcoming their coarse-grained versioning and their opacity that frustrates cooperative sharing. This allows Ventana to support the rich usage models facilitated by virtual machines, while avoiding the security pitfalls, management difficulties, and usability problems that virtual disks suffer from. This paper presents an FSR scheme (called VDE) which has ability to roll back the file system state to the previous saved state quickly. The VDE is similar to the bochs in some way, but it has the following advantages: (1) It is independent and does not need a VMM to support, thus, it has lower overhead and can be used in native operating system. (2) It has a very high recovery speed, and recovery speed is only related with disk capacity and has no relation with changed data size. (3) It uses block-based technique to protect file system, and provides more protection than file-based solutions. It can be applied to protect disk images used as a virtual disk in the VMM, physical partitions and disks. (4) It does not need an assistant operating system to recover the file system when the primary operating system boot fails.
3.
Data protection
There are many ways to protect data, and all most of them are implemented through the technique of data redundancy. In most case, data protection involves data backup and restore. In the following sections, we characterize several features of backup related with FSR: hardware- and software-based solutions; the usage of snapshots and copy-on-write mechanisms; full vs. incremental backups and file-based vs. block-based schemes (Chervenak et al., 1998).
3.1.
Hardware-based vs. software-based
In recent years, many technologies to protect data have been developed, such as RAID which provides RAID 0–RAID 5, NAS (network attached storage) and SAN (storage area networks), etc., and their implementation can be in software or hardware or both of them. For example, RAID can be implemented in hardware or software. Comparing with hardware-based solution, the main advantage of software-based solution is low price and can be
592
computers & security 25 (2006) 589–599
updated easily. However, it cost more system resources and may bring low performance to the system. Hardware-based solution offers several advantages over software-based solution: it does not hamper existing systems ability to perform; it has more robust fault-tolerant features.
3.2.
Snapshots and copy-on-write (COW)
The widely used technology for the static and dynamic FSR is to create a snapshot or frozen, read-only copy of the current state of the file system. The contents of the snapshot may then be copied to a backup device without danger of the file system hierarchy changing from subsequent accesses. The system can maintain any number of snapshots, thus providing read-only access to earlier versions of files and directories. A COW scheme is often used along with snapshots. Once a snapshot is created, any subsequent modifications to files or directories are applied to newly-created copies of the original data. Blocks are copied only if they are modified, which conserves disk space. Both VMware and VirtualPC use snapshots and the COW technologies to provide FSR facility.
3.3.
Full vs. incremental
The simplest way to protect a file system against file corruption is to copy the entire contents of the file system to a backup device. The resulting archive is called a full backup. If a file system is later corrupt due to some reasons, it can be reconstructed from the full backup. Full backups have two disadvantages: reading and writing the entire file system is slow, and storing a copy of the file system consumes significant capacity on the backup medium. Faster and smaller backups can be achieved by using an incremental backup scheme, which copies only those files that have been created or modified since a previous recover point. With the incremental backup scheme, computer system can be restored to a previous state quickly.
3.4.
File-based vs. block-based
Files consist of logical blocks, and these blocks, also called pages, are typically of fixed size in a special file system. Each logical file block is stored on a contiguous physical disk block. However, different logical blocks of a file may not be stored contiguously on disk. Backup software can operate either on files or on physical disk blocks. File-based backup systems understand file structure and copy entire files and directories to backup devices, and the file is the basic unit of data backup. File-based schemes are more portable, since the backup file contains contiguous files, and the notion of files is fairly universal. The disadvantage of file-based incremental backup schemes is that even a small change to a file requires the entire file to be backed up. By contrast, block-based, also called storage-based or device-based, backup systems ignore file structure when copying disk blocks onto the backup medium. This improves backup performance since backup software performs fewer costly seek operations. Storage block is the basic unit of data
backup in block-based backup systems. To allow file recovery, block-based backups must include information on how files and directories are organized on disks to correlate blocks on the backup medium with particular files. Thus, block-based programs are likely to be specific to a particular file system implementation and hardware configuration, and it is less portable. Both Microsoft System Restore and IBM Rescue and Recovery are implemented by using file-based incremental technologies. The VDE makes use of block-based scheme to implement file system protection. Unlike other block-based data protection systems, the VDE uses the map-on-write (MOW) other than the COW technique to protect a block. When a write operation is performed on a protected block, the MOW only maps this operation directly to another block without copying the original block data. On the contrary, when a protected block will be changed, firstly, the COW copies the original block data to a new allocated block, then applies this operation to the new allocated block. The new allocated block and original block are independent in the COW, but a mapped block in the MOW depends on its original block if the write operation only changes a part of original block. Therefore, a mapped block should be used with its original block in the MOW. Otherwise, the MOW may not provide correct data for a changed block. In the next section, we will give an explanation about the VDE in detail. For the sake of description, we use the hard disk as a specific storage device, but the algorithms can also be applied to hard disk partitions or other storage devices.
4.
Description of the VDE
The VDE is a software layer between storage device and file system, it completely controls the storage devices and provides a virtual disk environment for the file system. It intercepts all read and write operations from the file system or applications (only indicate those operations that bypass the file system), and provides a full protection for the file system. The VDE creates a separation between the ‘read-only’ base and a temporary writable overlay, which can either be made permanent or thrown away. The ‘read-only’ base is the disk state which has been saved in a specified time, if the temporary writable overlay is thrown away, the file system will roll back to the ‘read-only’ base; on the contrary, if the temporary writable overlay has been made permanent, it will be merged into ‘read-only’ base. The principle of the VDE used is based on sector mapping mechanism, its main idea is if a protected sector will be written, the VDE maps this write operation to another free sector, if a sector will be read, the VDE checks whether the sector has been mapped or not, if it has been mapped, the VDE reads the mapped sector, otherwise, the VDE reads this sector directly. To locate a sector quickly and take less disk space, the hard disk is divided into blocks with equal size according to logical block addressing (LBA). The block size should be changed with the requirement, here we use 16 continuous sectors as a block, and a hard disk would be divided into N blocks. Two-level index tables are used to locate the positions of mapped and
593
computers & security 25 (2006) 589–599
free sectors, the first index table is called block mapping table (BMT), and the second index table is called mapping index table (MIT). To protect the hard disk data, some free disk space should be allocated to map the write operations. The free space can be obtained from operating system through creating a temporary file. To improve the performance, the free space allocated by operating system should be continuous as much as possible, but not must. We call the allocated free space as mapping space (MP). The MP is divided into three sections: the first section is used to save BMT table, MIT table is located in the second section, and the last section is used as mapping blocks, a mapping block in the MP has the same size as a block in the hard disk. BMT table is used to trace the mapping states of the hard disk blocks, every bit in the BMT represents a hard disk block, a bit with 1 means the corresponding hard disk block has been mapped, otherwise it has not been mapped. MIT table is employed to register the mapping index and mapped sectors, every 32-bit (this value should be changed with requirement) word represents a hard disk block, low 16 bits are mapping index and high 16 bits are mapped flags of sectors in a hard disk block. Mapping index is used to addressing an MP block that has been used to map the corresponding hard disk block. Every sector in a hard disk block has a corresponding bit in the mapped flag bits, a bit with 0 in the mapped flag bits means the sector in the hard disk block has not been mapped, else it has been mapped. Fig. 1 gives the illustration about the relationship between a hard disk block and the MIT. It’s a one-to-one mapping between the block representations in both index tables and the hard disk blocks, the relationship between hard disk blocks and tables is illustrated in Fig. 2. Generally, to address a hard disk block, the VDE needs to lookup BMT and MIT tables, however, if the block has not been mapped, the VDE only requires to lookup BMT table. In our design, BMT table is smaller than MIT table, its size is only 1/32 of MIT table, and this kind of design help to locate a hard disk block quickly.
4.1.
Mapped flags of sectors in a block 31 30
16 15
0
...
Mapping index
...
One unit of MIT with 32 bits
A hard disk block with 16 sectors
Fig. 1 – The relationship between a hard disk block and the MIT. (1) Locate the offset in BMT table A sector with addressing value StartLBA is located in nth block, where n ¼ BlockIndex ¼ PStartLBA=16R
(1)
A block with index value BlockIndex, its byte offset in the BMT is: ByteOffsetInBMT ¼ PBlockIndex=8R
(2)
The bit offset in the byte is: BitOffsetInByte ¼ ByteOffsetInBMT mod 8
(3)
(2) Locate the offset in MIT table A block with index value BlockIndex, its byte offset in the MIT is: ByteOffsetInMIT ¼ BlockIndex 4
(4)
The sector index in the mapped flag bits is: MITSectIndex ¼ StartLBA mod 16
4.2.
(5)
Algorithms
The VDE intercepts the read and write requests to the hard disk, and provides mapping services between file system and hard disk according to the following algorithms. Fig. 3 illustrates this procedure. The algorithms are described with C language.
Formulas used in algorithms
In the following equations, we assume one sector has 512 bytes, and the protected hard disk has N blocks. The addressing mode of the hard disk is treated as LBA mode, if the addressing mode of the hard disk is CHS mode, its address can be changed to LBA value easily.
(1) Write operation When an application or file system writes a sector from the memory to hard disk, the VDE maps this operation to a sector
Disk size (N blocks) Begin of disk
Block_1
Block_2
...
Block_N-1
Block_N
End of disk
Begin of BMT
Bit_1
Bit_2
...
Bit_N-1
Bit_N
End of BMT
Begin of MIT
Index_1
Index_2
...
Index_N-1
Index_N
End of MIT
Fig. 2 – The relationship between hard disk blocks, BMT and MIT.
594
computers & security 25 (2006) 589–599
in the MP block. The following function Write() gives the algorithm to process this operation. If the application or file system writes more than one sector at a time, just repeats this procedure one by one. Here, we assume the sector address is StartLBA.
original sector. The following function Read() gives the algorithm to process this operation. If the application or file system reads more than one sector at a time, just repeats this procedure one by one. The GetBMTState(), GetMITState() and other parameters used in Read() are same as Write().
(2) Read operation When an application or file system reads a sector from the hard disk to memory, the VDE checks the sector has been mapped or not, if it has been mapped, then reads the data from mapped sector. Otherwise, reads the data from the
(3) Recover and save disk state It is very simple to recover disk state in the VDE, all we need to do is clearing all flags in BMT and MIT tables. That’s to say, the space of BMT and MIT tables will be set to 0. In this way, all mapped sectors that have been changed will be in a state without mapping, and these sectors will revert to its original state. If we want to recover disk state when the VDE program exits, we do not need to do any thing to BMT and MIT tables, and simply delete the temporary file used as the MP (this operation has no relation with recovery, it only releases disk space). The reason is: if the VDE program terminates, no program provides the mapping function, then the mapped sector will revert to its original state. This feature is very useful, it provides not only fast recovery speed but also the ability to recover the disk state even when operating system crashes (no consider with hardware faults). The reason is: if the VDE
Write
BMT
Read
0->1
MIT
...
Index
1
...
MP Blocks
BMT
0
BMT
No mapped sectors ...
...
...
Hard disk Blocks
Fig. 3 – The read and write operations in the VDE.
...
595
computers & security 25 (2006) 589–599
program could not be run under this condition, the disk state would be reverted to its original state; by contrast, if the VDE program could execute under this condition, users can notice the VDE program to recover the system. To save the current disk state, what we need to do is to copy the data from the MP blocks to the corresponding hard disk blocks if the sectors in the blocks have been mapped. This procedure may take minutes or more, and more the sectors have been mapped, the more time is needed. The following function Save( ) gives the algorithm to save the current disk state.
Due to the VDE is a block-based scheme, it does not affect the file attributes information (date, owner, permissions, etc.) when it commits the mapped blocks to their original locations.
(a)
5.
Implementation
To show our ideas can be implemented in a practical and efficient way, we developed a VDE prototype called FlashBack on Windows platform based on the presentation in this paper. We used Windows only as a test platform, and our solution can also be applied to other platforms, such as Linux, Unix, etc. FlashBack is written in Assemble and C languages. We developed it under Windows 2000 on 86 PCs, and it can also be applied to Windows XP and Windows server 2003. Fig. 4 outlines the FlashBack’s architecture, boot flowchart, and user interface. There are several run modes on 86 PCs, such as real mode, protected mode, and virtual-8080 mode, etc., thereinto, real mode and protected mode have relations with the hard disk protection. As a design specification, 86 PCs start in real mode at boot time to ensure backwards compatibility with legacy operating systems. They must be manually switched into protected mode by a program before any protected mode features are available. In modern computers, this switch is usually one of the very first tasks performed by the operating system at boot time. To provide a full hard disk protection without the VMM supports, FlashBack intercepts all disk operations both in real mode and protected mode. When a computer is started or reset, it runs the power-on self test (POST). After the computer finishes the POST, the system BIOS (basic input/output system) starts an operating system from the master boot record (MBR) on the hard disk (here, we assume the BIOS is configured booting from a hard disk). The MBR then scans the partition table for the system partition information. When the system partition information has been read, it loads boot sector of the system partition into memory and starts it. On Windows 2000, boot sector loads the startup file Ntldr which loads the operating system files from the system partition. During these procedures, all disk operations are performed through Int 13h provided by the BIOS. Therefore, FlashBack should hook Int 13h during the Windows startups. To realize this task, FlashBack changes the system
(c)
(b) Application programs
System boot [BIOS] Ntldr
Direct-to-disk operations
File operations new MBR File system drivers
VDE Filter Layer
VDE for real mode
device drivers, filter drivers include VDE for protected mode, system kernels
BIOS interface/Hard disk driver original MBR Hard disk
Architecture
other programs
Boot flowchart
User interface
Fig. 4 – The implementation of the VDE on the Windows 2000 platform.
596
computers & security 25 (2006) 589–599
boot flow through replacing the original MBR with a new MBR which loads VDE for real mode module. Fig. 4(b) illustrates the system boot flow under FlashBack. The VDE for real mode module is a BIOS-based program, and it only uses the BIOS interface and does not call any API that operating system provides. The VDE for real mode module is composed with two sections, one section is programmed as a TSR (terminate and stay resident) and hooks Int 13h to implement the read and write operations according to the algorithms presented in this paper, another section provides a user interface and other functions such as recovering and saving disk state. The user interface can be activated by a hot key before it loads original MBR, Fig. 4(c) shows the user interface. FlashBack backs up the original MBR to a protected place, and maps all write operations to the MBR to the sector where the original MBR residents. The new MBR and the VDE for real mode module are protected by FlashBack itself, they cannot be destroyed except in the condition when the user boots system with other storage media such as a floppy disk or CD. By this way, FlashBack has an ability to recover the file system even the when the system boot fails. Windows 2000 has its own 32-bit disk driver, after it startups, all disk operations are performed through its own disk driver, and it does not call Int 13h anymore. The Windows Driver Model (WDM) assumes that a hardware device can have several drivers, each of which contributes in some way to the successful management of the device (Microsoft Company, 2000). WDM accomplishes the layering of drivers by means of a stack of device objects. To intercept the disk operations after windows startup, we can program a disk filter driver and install it on the disk driver. VDE for protected mode module in the flowchart is a hard disk filter driver, and its function is same as the TSR section of the VDE for real mode module. In a word, whether Windows runs in real mode or protected mode, FlashBack intercepts all disk read and write operations, and implements mapping operations according to the algorithms presented in this paper. Fig. 4(a) shows the architecture of FlashBack, and ‘‘VDE filter layer’’ in Fig. 4(a) indicates the modules of VDE for real and protected mode.
6.
Performance analysis
Comparing with other data protection solutions, such as the file-based and COW solutions, our solution has the following advantages in the performance: (1) Low overhead: our solution does not need to backup the data, it just redirects read and write operations. The extra workloads are mapping operations, and these operations only take a little CPU utilizations. Therefore, it has a little performance overhead. On the contrary, the file-based and COW methods both need to backup data beside other additional operations, thus, they bring more amount of performance overhead than our method. (2) Less disk space: comparing with file-based schemes, our solution uses less disk space. The VDE uses block-based incremental mapping schemes, and it has no disadvantage of file-based incremental backup schemes which even
a small change to a file requires the entire file to be backed up. However, the VDE takes more disk space and other system resources than native environment. In the following section, we give a basic analysis about additional disk space and system resources used by the VDE.
6.1.
Extra disk space
The VDE does not involve data backup operation, it only maps the disk write operations to the MP, therefore, the main extra disk space occupied by the VDE is those disk space used as BMT and MIT tables, as to mapping blocks in the MP, it is used to save data just as it is used in native environment. The programs to realize the VDE only take a little disk space, it can be ignored. We assume one sector has 512 bytes, one block takes 16 sectors in the following calculation. (1) BMT space The BMT uses 1 bit to indicate a block, one BMT sector can express 512 8 16 ¼ 65,536 hard disk sectors, thus, the BMT takes about 1/65,536 ¼ 0.0015 percent of the hard disk space. (2) MIT space The MIT uses 32 bits to indicate a block, one MIT sector can express 512 8 16/32 ¼ 2048 hard disk sectors, thus, the MIT takes about 1/2048 ¼ 0.0488 percent of the hard disk space. To sum up (1) and (2), the additional disk space used by the VDE is about 0.0503 percent of the hard disk space.
6.2.
Extra system resources utilized
Comparing with native environment, the additional system resources used by the VDE are memory and CPU. The VDE uses CPU to lookup BMT and MIT tables, as for memory, the VDE filter drivers need to resident in memory, it takes about 6 KB according to the implementation, however, the VDE needs some memory to save BMT and MIT tables in order to reduce the times to read/write BMT and MIT tables from/to the hard disk. The more memory used to hold BMT and MIT tables, the higher speed of looking up tables can be obtained, and the CPU utilization can also be reduced. However, the more memory is used by the VDE, the less memory is left to the system, and there exists an optimal value. According to experimental results, when allocated memory is about 1/105 of the hard disk space, the VDE only has a little influence to system. FlasBack uses an LRU (least recently used) algorithm to exchange BMT and MIT tables between memory and disk. When FlashBack needs to read new data from BMT or MIT table, it either writes the least recently used BMT or MIT table in memory to disk or just discards it if they have not been modified.
6.3.
Experiments and testing results
To validate our algorithms, we used FlashBack as our testing program and Iometer (Sourceforge) (version dated 2004.07.30) for the measurement and characterization of
597
computers & security 25 (2006) 589–599
native and VDE systems. Iometer is an I/O subsystem measurement and characterization tool for single and clustered systems. It is a standard industry benchmark and the sources are available, allowing us to study the code when needed. We created an Iometer configuration file that varied three parameters: block size for I/O operations, percentage of read operations and degrees of randomness. We collected the following data points in the parameter space: Blocksize zfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflffl{ 1K; 4K; 8K; 16K
!
%Read zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{ 0; 25; 50; 75; 100
!
%Random zfflfflfflfflfflffl}|fflfflfflfflfflffl{ 0; 50; 100
!
We then presented and compared these data on the native and VDE systems, and the results of experiments show that the VDE has little performance overhead. The experiments were performed on a PC with a 1200 MHz Intel Celeron CPU, 256 MB RAM, Windows 2000 with service pack 4 and a 60 GB Seagate IDE hard disk (ST360021) which supports ultra100. Identical configurations were used for each experiment and the same physical disk was used to compare the native and VDE throughputs and CPU utilizations. Each Iometer experiment was set up with one worker, and each data point was run for 3 min to give the output values sufficient time to stabilize. Finally, in each case Iometer was configured to access the disk through the file system rather than directly through a raw or physical device interface.
Fig. 5 and Fig. 6 show the Iometer results. The throughputs and CPU utilizations obtained with the Iometer micro-benchmarks are characteristic of the combination of hardware and workload. In order to explain the figures, we consider the way Iometer is generating I/Os. As the %Read parameter is varied, Iometer computes the probability that the next operation is a read as pðReadÞ¼ %Read=100 and that it should seek to a random offset as pðSeekÞ¼ %Random=100. For example, for (%Read ¼ 50%, %Random ¼ 0, BlockSize ¼ 8 KB), Iometer will perform completely sequential I/O on 8 KB blocks where roughly half the operations will be reads (Ahmad et al., 2003). This helps explain the throughputs when %Random ¼ 0 (sequential I/O). When Iometer performs a sequential I/O, the throughput peaks at 100% reads and 100% writes (%Read ¼ 0), but mixed (0 < %Read < 100) reads and writes appear to suffer. There are multiple factors that account for this behavior: peculiarities of the workload generator itself or caching behavior for reads vs. writes. In the sequential case of mixed reads and writes, Iometer does sequential I/O accesses and chooses for each transaction whether it will be a read or a write based on the access specification provided to it. In other words, the reads and writes are not two separate sequential streams. Depending upon how disk controllers and/ or drives reorder accesses or perform caching optimizations, this can affect the throughput significantly. We believe that
10
4
Throughput(MBps)
Throughput(MBps)
3.5 BlockSize = 1K
3
VDE Environment Native Environment
2.5 2 1.5 1
8
BlockSize = 4K VDE Environment Native Environment
6 4 2
0.5 0 100
0 100 80 60
80
80
100
60
40
20 0 0
20
40
20
Read
0 0
16
100
20
Read
30 BlockSize = 8K
14
BlockSize = 16K
VDE Environment Native Environment
12
Throughput(MBps)
Throughput(MBps)
80 60
Random 40
60
Random 40
10 8 6 4 2 0 100
25
VDE Environment Native Environment
20 15 10 5 0 100
80 60
80 60
Random 40
40
20 0 0
20
Read
100
80 60
80
100
60
Random 40
40
20 0 0
20
Read
Fig. 5 – The VDE and native disk throughputs. The graphs plot the disk throughput for different percentages of reads and degrees of randomness for four different block sizes: 1 KB, 4 KB, 8 KB and 16 KB, respectively.
598
computers & security 25 (2006) 589–599
20
35
BlockSize = 1K
30
CPU Utilization ( )
CPU Utilization ( )
40
VDE Environment Native Environment
25 20 15 10
BlockSize = 4K
15
VDE Environment Native Environment
10
5
5 0 100
0 100 80 60
80
80
100
60
60
Random 40
40
20 0 0
Random 40 0 0
20
100
Read
20
20
CPU Utilization ( )
BlockSize = 8K
CPU Utilization ( )
40
20
Read
20
80 60
VDE Environment Native Environment
15
10
5
0 100
BlockSize = 16K VDE Environment Native Environment
15
10
5
0 100 80 60
80
80
100
60
60
Random 40
40
20 0 0
20
80
Random 40
40
20
Read
100
60 0 0
20
Read
Fig. 6 – The VDE and native CPU utilizations. The graphs plot the CPU utilization for different percentages of reads and degrees of randomness for four different block sizes: 1 KB,4 KB, 8 KB and 16 KB, respectively.
the read-ahead and write caching algorithms might interfere in this case resulting in lower throughput. In theory, the VDE throughput should be lower than native under the same condition because the VDE needs to lookup tables. However, we can find some unusual points in the figures where the VDE throughput is higher than native, and the unusual points increase with the degrees of randomness and block size. There are two reasons for this kind of phenomenon: first, the VDE uses 8K as a block, it is able to provide a cache for some small block read and write operations. Second, the VDE can change some non-continuous read and write operations to the contiguous operations. In Fig. 6, we find a special phenomena, when Random% ¼ 0 and %Read ¼ 0 (sequential write operation), CPU utilizations have an obvious difference in all BlockSizes, the reason is: the VDE maps all write operations to the MP, it not only needs to lookup both BMT and MIT tables, but also needs to exchange these tables between memory and disk, thus it takes more CPU utilization, on the contrary, when Iometer does read operations, in some situations, for instance, all sectors have not been mapped, the VDE only needs to lookup BMT table, thus it takes less CPU utilizations. As to the special points when BlockSize ¼ 1K, it still can be explained with caches that the VDE provides just as for throughput. Even though the throughput and CPU utilization have unusual characteristics, these are inherent to the combination
of the hardware and the workload generator, and thus do not affect the ability of micro-benchmarks to model application performance.
7.
Conclusions
Our society is becoming more and more dependent on computer systems, which nowadays are used in everyday life, from business to banking, from entertainment to healthcare. On the other hand, there are more and more threats, such as malicious code and misuse, to the computer system. FSR provides a protection for file system, and it provides a convenient approach to recover system quickly in case of system corrupts. FSR is becoming increasingly common in the computer worlds, the degrees of data redundancy and recovery speed are important considerations in their deployment. In this paper, we present a high efficiency VDE to implement FSR. The VDE is very suitable to use in those fields where no data need to be saved, such as PC Terminals, Internet bars and other similar situations. Beside these fields, the VDE can also be used in daily works where user’s data must be saved. This can be achieved through a simple configuration, for instance, we can use the VDE to protect primary partition where only operating system and applications resident and user’s data are saved to other unprotected partitions.
computers & security 25 (2006) 589–599
Acknowledgements We would like to thank the anonymous reviewers for their comments and suggestions.
references
Ahmad I, Anderson JM, et al. An analysis of disk performance in VMware ESX server virtual machines. In: Proceedings of the sixth annual workshop on workload characterization Austin, Texas; October 27, 2003. Archer JE, Conway R, Schneider FB. User recovery and reversal in interactive systems. ACM Transactions on Programming Languages and Systems 1984;6:1–19. Chervenak AL, Vellanki V, Kurmas Z. Protecting file systems: a survey of backup techniques. In: Proceedings of the Joint NASA and IEEE mass storage conference; March 1998. David A, Lomet B. High speed on-line backup when using logical log operations. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, 2000, Dallas, Texas, USA. IBM, . Kim W, Whang K, Lee Y, Kim S. A recovery method supporting user-interactive undo in database management systems. Information Sciences Mar. 1999;114(1/4):237–53. Lawton K, Denney B. Bochs x86 pc emulator users manual, http:// bochs.sourceforge.net/; 2003. Microsoft Company. Windows 2000 driver design guide. Microsoft Press; 2000. Microsoft Corporation, . Microsoft, . Nanda S. A survey on virtualization technologies. RPE report, http://www.ecsl.cs.sunysb.edu/tech_reports.html; February 2005. Pei D, Wang DS, Zheng WM. Design and implementation of a lowoverhead file checkpointing approach. In: IEEE international conference on high performance computing in the Asia– Pacific Region (HPC-Asia); 2000. Pfaff Ben, Garfinkel Tal, Rosenblum Mendel. Virtualization aware file systems: getting beyond the limitations of virtual disks. In: Third symposium of networked systems design and implementation (NSDI); May 2006. Piernas Juan, Cortes Toni, Garcı´a Jose´ M. DualFS toward a new journaling file system. Sep. 2002, Jornadas de Paralelismo. Prakash A Michael, Knister J. A framework for undoing actions in collaborative systems. ACM Transactions on Computer– Human Interaction (TOCHI) Dec 1994;1(4):295–330. Qemu cpu emulator, ; 2004. Seltzer Margo I, Ganger Gregory R, McKusick M. Kirk, Smith Keith A, Soules Craig AN, Stein Christopher A. Journaling versus soft updates: asynchronous meta-data protection in file systems.
599
In: USENIX annual technical conference (San Diego, CA, 18C23 June 2000); 2000. p. 71C84. Sourceforge, . Stein C, Howard J, Seltzer M. Unifying file system protection. In: Proceedings of the 2001 USENIX technical conference, Boston, MA, 79C90; June 2001. Symantec, . VMware Inc. VMware virtual machine technology, http://www. vmware.com/; September 2000. Washizaki H, Fukazawa Y. Dynamic hierarchical undo facility in a fine-grained component environment. In: CRPITS’02: proceedings of the 40th international conference on tools pacific, 2002, Sydney, Australia;, ISBN 0-909925-88-7; 2002. p. 191–9. Zhu Ningning, Chiueh Tzi-cker. Design, implementation and evaluation of repairable file service. In: The international conference on dependable systems and networks, San Francisco, CA; June 22nd–25th, 2003.
Jinqian Liang ([email protected]. edu.cn) received his B.S. degree in mechanical and electronic engineering from Tianjin University, Tianjin, China, in 1991 and his M.S. degree in control theory and engineering from Tsinghua University in 2001. He is currently a Ph.D. candidate at the Center for Intelligent and Networked Systems, Tsinghua University, Beijing, China. His research interests currently focus on computer security. Xiaohong Guan (xhguan@tsinghua. edu.cn) received his B.S. and M.S. degrees in control engineering from Tsinghua University, Beijing, China, in 1982 and 1985, respectively, and his Ph.D. degree in electrical engineering from the University of Connecticut in 1993. He was a senior consulting engineer with PG&E from 1993 to 1995. He visited the Division of Engineering and Applied Science, Harvard University from Jan. 1999 to Feb. 2000. From 1985 to 1988 and since 1995 he has been with the Systems Engineering Institute, Xi’an Jiaotong University, Xi’an, China, and currently he is the Cheung Kong Professor of Systems Engineering and Director of the National Lab for Manufacturing Systems. He is currently the Chair of Department of Automation and Director of the Center for Intelligent and Networked Systems, Tsinghua University, China. His research interests include computer network security, wireless sensor networks and economics and security of complex networked systems.
computers & security 25 (2006) 600–615
Wavelet based Denial-of-Service detection Glenn Carla, Richard R. Brooksb,*, Suresh Raic,1 a
Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16801, USA Holcombe Department of Electrical and Computer Engineering, Clemson University, 313-C Riggs Hall, P.O. Box 340915, Clemson, SC 29634-0915, USA c EE Department, LSU, Baton Rouge, LA 70803, USA b
article info
abstract
Article history:
Network Denial-of-Service (DoS) attacks that disable network services by flooding them
Received 17 May 2005
with spurious packets are on the rise. Criminals with large networks (botnets) of compro-
Revised 30 June 2006
mised nodes (zombies) use the threat of DoS attacks to extort legitimate companies. To
Accepted 3 August 2006
fight these threats and ensure network reliability, early detection of these attacks is critical. Many methods have been developed with limited success to date. This paper presents an
Keywords:
approach that identifies change points in the time series of network packet arrival rates.
CUSUM
The proposed process has two stages: (i) statistical analysis that finds the rate of increase
DDoS/DoS
of network traffic, and (ii) wavelet analysis of the network statistics that quickly detects the
Network security
sudden increases in packet arrival rates characteristic of botnet attacks.
Performance testing
Most intrusion detections are tested using data sets from special security testing configu-
Haar transform
rations, which leads to unacceptable false positive rates being found when they are used in the real world. We test our approach using data from both network simulations and a large operational network. The true and false positive detection rates are determined for both data sets, and receiver operating curves use these rates to find optimal parameters for our approach. Evaluation using operational data proves the effectiveness of our approach. ª 2006 Elsevier Ltd. All rights reserved.
1.
Introduction
Network attackers have many goals. These include thrills for ‘script kiddies’ or intentional damage for employees, criminals, or political organizations. It is unlikely that exploitation of software errors and system misconfiguration will ever stop. The increasing number of always-on broadband connections administered by untrained users is likely to further this trend. Intrusion detection systems (IDS) identify and report security violations. Early detection supports attack mitigation before a system and its users are affected. Many commercial
IDS are available. Newman et al. compared many of these commercial products and stated: One thing that can be said with certainty about network-based intrusion-detection systems is that they’re guaranteed to detect and consume all your available bandwidth. – Because no product distinguished itself, we are not naming a winner. This result expresses the danger of using systems with high false positive rates. Anyone familiar with Aesop’s fable of ‘‘The Boy Who Cried Wolf’’ knows that systems with high false positive rates consume system resources when no
* Corresponding author. Tel.:þ1 864 656 0920; fax: þ1 864 656 1347. E-mail addresses: [email protected], [email protected] (R.R. Brooks), [email protected] (S. Rai). 1 Tel.: þ1 225 578 4832; fax: þ1 225 578 5200.
computers & security 25 (2006) 600–615
attacks exist, and when real attacks occurs alarms are almost certain to be ignored. This paper presents a new algorithm for detecting packet flooding Denial-of-Service (DoS) intrusions. For background on DoS IDS, readers are directed to the current survey in Carl et al. (2006). Our approach combines the benefits of both statistical and wavelet analysis techniques. Statistical methods detect features that are indicative of DoS attacks in the network traffic patterns. If properly tuned, statistical approaches can have relatively low false positive rates, but (as shown in Section 8) the time required to signal that a DoS attack is underway can be unacceptably long. Attacks are signaled only after the damage has been done and users inconvenienced. The use of wavelets, on the hand, can detect attacks quickly, but often at the cost of incurring higher false positive rates. In Section 8 we show that, by combining both techniques, detection efficiency is better than with a pure wavelet approach and detection delay is lower than with purely statistical techniques. Our method can be tuned to fit the needs of a specific installation. Another major issue with current intrusion detection approaches is the lack of effective testing. It is difficult to find operational networks for intrusion detection testing, since almost every enterprise is unwilling to tolerate the effects of a live attack on their network. The researcher can at best set up detection software on the network and hope an attack will occur. Scientific analysis of these results is almost impossible since the researcher has no reliable ground truth for comparison. Most researchers resort to testing their solutions using either network simulators like ns-2 (VINT Project), testbeds constructed for network intrusion research (McHugh, 2000), or Honeypots. We will see in Section 8 that the background traffic in all of these systems is nowhere near as chaotic as the traffic in operational networks. Tests that do not use operational networks almost always produce artificially low false positive rates. A fuller discussion of these issues can be found in Brooks (2005). The survey in Carl et al. (2006) focuses specifically on the testing of DoS detection software. In Sections 5–8 of this paper we present a new test methodology for validating DoS detection software. It has the strong points of both network simulation (or specialized test facility) testing and of using operational networks. In this approach, we use background traffic collected from a typical corporate setting and merge it with network DoS attack traffic data. The contributions of this paper are twofold. First, we present a new method for detecting network flooding DoS attacks. The approach has lower false positive rates and shorter reaction times than competing approaches. Second, we evaluate our approach using techniques that insert attack data into traffic sets containing realistic background traffic. This testing approach goes beyond the current art and suggests ways that researchers can provide believable test results for intrusion detection systems. The rest of the paper’s organization is as follows. Section 2 provides background on Denial-of-Service (DoS) attacks. Section 3 reviews previous work. Section 4 describes our wavelet-based DoS detection method. Sections 5–7 detail test data generation, detection method implementation, and testing procedures, respectively. Section 8 presents detection results. Finally, Section 9 discusses our results and concludes the paper.
2.
601
Denial-of-Service (DoS) attacks
Denial-of-Service (DoS) attacks attempt to prevent legitimate use of a networked resource. Their target can be an entire network or an individual machine. Many kinds of attacks exist. UDP attacks saturate the network links and queues en route, starving the target of legitimate traffic through congestion. TCP SYN flood attacks consume end-system resources with numerous TCP connection requests (Ricciuli et al., 1999). Malicious nodes initiate, but do not complete, the three-way handshake TCP uses to initiate connections. This leaves many stale TCP connection requests on the victim. The connection is never completed, and continues to consume end-system resources until its eventual time out. During both attacks (UDP and TCP Syn flood) legitimate users are ‘denied service.’ This paper considers only DoS flooding attacks that induce network congestion. A TCP SYN flood targets queue management on the node and is outside the scope of this paper. To increase attack effectiveness, Distributed Denial-ofService (DDoS) flooding attacks are becoming increasingly common. DDoS flooding attacks start with an attacker compromising individual hosts. The attacker inserts a ‘‘zombie’’ process on the node; that can be triggered remotely to send large volumes of network traffic to a victim node. The zombie processes hosts are typically aggregated into a large ‘‘botnet’’ that the intruder commands using attack software, like the Stacheldraht tool. These botnets can be massive, some containing over a million nodes. Frequently, botnets are used to threaten legitimate enterprises with DoS attacks and extort money (Information week, 2005; MSNBC, 2004). When launched, the DDoS attack slows down or crashes the target. The target has no advance warning until an attack is initiated.
3.
Previous work
There exist a number of DoS attack studies (Ricciuli et al., 1999; Paxson, 1999; Spatscheck and Petersen, 1999; Vigna and Kemmerer, 1999; Dittrich, 2000; Strother, 2000). Most of these studies address vulnerabilities or possible countermeasures, but few focus on attack detection. Xu and Lee (2003) isolate and protect web servers against DDoS attacks. They address the attack and countermeasure issue by using a game-theoretic framework that models the system’s performance as the minimax solution between conflicting goals, i.e. their solution configures the server in the way that provides the best service possible while under attack. More recent reports (Anderson et al., 2004; Lakshminarayanan et al., 2004; Kreibich and Crowcroft, 2004) look at the ways of preventing and constraining DoS attacks. Anderson et al. (2004) rely on the use of a ‘send-permission-token’ to restrict DoS attacks. Lakshminarayanan et al. has a host insert a packet filter at the last hop IP router. The ‘Honeycomb’ technique (Kreibich and Crowcroft, 2004) uses a decoy computer, pattern-matching techniques, and protocol conformance checks to create intrusion detection signatures. Recently a number of companies, like Tipping Point, are commercializing intrusion prevention systems. Unlike
602
computers & security 25 (2006) 600–615
intrusion detection systems, which are typically software solutions, intrusion prevention systems use custom hardware that is situated inside the corporate firewall. They perform stateful monitoring of all data streams passing through the hardware, and remove packets that violate protocols. Since these are commercial offerings, their technologies have not been subject to peer review and critical comparison of their performance is difficult to find. The marketing information that is available contrasts so extremely with independent evaluations, like the one in Newman et al., as to be suspect. Attack detection research is limited. In Moore et al. (2001), IP traffic is monitored for unsolicited ‘backscatter’ packets. Backscatter packets are a non-collocated victim’s response to specific attacks (e.g. TCP SYN flood, closed-port probe). This approach quantified the global DoS activity at around 12,000 attacks over a three-week period. Allen and Marin (2004) use estimates of the Hurst parameter to identify attacks which cause a decrease in the traffic’s self-similarity. This requires statistics of network traffic selfsimilarity before the attack. Network traffic is self-similar, in that it has been shown to be fractal (multi-fractal) under coarse (fine) scales (Leland et al., 1993; Abry et al., 2002). The high-speed core network is not self-similar since the traffic in that domain tends to follow a Poisson distribution (Cao et al.). The approach in Allen and Marin (2004) is made more difficult by there not being a widely accepted metric for computing the fractal dimension (self-similarity metric) of network traffic dynamics. Two statistical methods of analyzing network traffic to find DOS attacks are provided in Feinstein et al. (2003). One monitors the entropy of the source addresses found in packet headers, while the other monitors the average traffic rates of the ‘most’ active addresses. In Barford et al. (2002), wavelets are used to spectrally separate time localized anomalous ‘signals’ from normal traffic ‘noise.’ In Wang et al. (2002) and Blazek et al. (2001), changepoint detection analysis with a cumulative sum algorithm is used to identify traffic changes. Our method builds on the approach in Blazek et al. (2001). Our use of wavelets is similar to Barford et al. (2002), but we use wavelets for detecting change-points in the cumulative sum statistic and not for traffic signal decomposition. Section 4 describes our approach in detail and also provides insights into its precursors (Barford et al., 2002; Blazek et al., 2001).
4.
Our approach is an anomaly detection system. Anomaly detection systems can detect new attacks by observing their effect on the running system, but they also tend to have higher false alarm rates than template based systems. Many traffic characteristics can be measured: rate, protocol, packet sizes, source/destination addresses, and others. For a TCP SYN attack, a system queue (or number of halfopen connections) can be measured. Protocol modifications have addressed TCP SYN attacks; making them unlikely. Detecting these attacks is therefore not very interesting. For UDP flood attacks, the number of packets sensed at a monitoring location could be used to detect a DoS attack. Our approach for detecting packet flooding DoS attacks counts the number of incoming packets in uniform time intervals. This is applicable anywhere in the network. At the target or en route, a DoS attack is determined by looking for an abrupt, positive change in the number of in-coming packets. Slow, ramping DoS attacks may avoid detection by our approach. But they are also less deadly, since their slow onset provides time for network administrators to detect them through normal network performance analysis.
4.1.
Change-point detection
DoS attacks affect the statistical properties (e.g. mean, variance) of the arrival traffic, but with unknown magnitude and temporal variation. Target proximity, number of zombies, and attack coordination are factors that modify these properties. We monitor the mean and variance of network traffic volume to identify the onset of a DoS attack. We assume that the mean of the baseline traffic varies smoothly, and that the change due to attack traffic is greater than normal changes in the baseline. Note that an effective attack will require a large increase in traffic volume in most realistic scenarios.2 Any network link or end-system node can be monitored for DoS attack traffic. Let Npt(k) denotes a time-series representing the number of recorded packets during discrete time interval k. The non-overlapping time intervals are of uniform length I, therefore Npt(k) can be considered the arrival packet rate during interval k. Interval length I is experimentally chosen such that the monitored series Npt(k) is not smeared and also does not contain large fluctuations. If a DoS attack begins at time l, the packet time series Npt(k) will be characterized by a significant increase for k l. Let
DoS detection
IDS are classified as either signature or anomaly detection systems. Signature detection systems observe and compare patterns with known templates. If there is confidence in a template match, a known security breach is declared. Signature systems can be limited by the size and maturity of the template database. Most current virus scanners are template based. The drawback to this approach is that it can only detect known malware implementations. It is typically unable to identify brand new exploits. Anomaly detection systems compare observed patterns against a nominal model. Security violations are declared when observed patterns deviate from the expected model.
2
It does occur, though, that nodes are occasionally overwhelmed by legitimate traffic. This is sometime known as ‘‘flash events’’ or the ‘‘Slashdot effect.’’ It occurs when sites suddenly become overwhelmingly popular with their legitimate user community. For example, a new Linux distribution may be released, or a notice can be posted about an item on a popular website. This leads to a sudden increase in traffic that the server is unprepared to handle. It is almost impossible to tell the difference between flash events and DoS attacks by botnets, since the operational effect is the same: the server is overwhelmed by traffic and unable to perform its work. We feel that it is unrealistic to expect a network monitor to tell the difference between the two phenomena, since difference between the two lies in the intentions of the users and not in the traffic generated.
computers & security 25 (2006) 600–615
Npt(k) be the superposition of normal (or baseline) traffic N0pt ðkÞ and attack traffic N1pt ðkÞ: ( N0pt ðkÞ; 0 < k < l Npt ðkÞ ¼ (1) N0pt ðkÞ þ N1pt ðkÞ; k l Assuming N0pt ðkÞ and N1pt ðkÞ are independent, the average rate m(k) of Npt(k) is the summation of the individual traffic means m0(k) and m1(k) 0 m ðkÞ; 0 < k < l (2) mðkÞ ¼ m0 ðkÞ þ m1 ðkÞ; k l Due to the DoS attack, the average arrival rate m(k) will increase for k l. To identify this increase, change-point detection based on a cumulative sum (CUSUM) statistic is used. The CUSUM algorithm can be described as repeated testing of the null and alterative hypotheses (Basseville and Nikiforov, 1993) through the use of a sequential probability ratio test (SPRT). For time-series Npt(k), let P0(Npt(k)) and P1(Npt(k)) denote the probability of no-change and change, respectively. The SPRT process is then + P1 ðNpt ðkÞÞ ; Sð0Þ ¼ 0 (3) Ssprt ðkÞ ¼ Ssprt ðk 1Þ þ log P0 ðNpt ðkÞÞ where (x)þ ¼ max(0, x). The decision rule accepts the null hypothesis (no-change) when Ssprt(k) h, where h is a positive number. Conversely, when Ssrpt(k) > h, the alternative hypothesis (change) is accepted. If g is a given false alarm rate, then the optimal threshold h is given by: (4) P0 Ssprt ðkÞ > h ¼ g
603
Substituting Eq. (7) for Npt(k) in S(k), leads to a modified CUSUM algorithm: þ ~ ~ ~ 1Þ þ N ~ pt ðkÞ mðkÞ c ; Sð0Þ ¼0 (8) SðkÞ ¼ Sðk ~ pt ðkÞ is a windowed average of Npt(k) and c ¼ ce m(k) is where N a correction factor. Parameter ce reduces the variance of both noise and slowly varying traffic, by removing their contribution to the CUSUM. ~ Also, c decays SðkÞ back toward zero during periods of normal traffic activity. Parameters 3 and (1 a) determine the amount ~ pt ðkÞ and m(k), respectively. Large of past history captured in N values of 3 and (1 a) imply long memory dependence. In Section 8, the values of 3, a, and ce will be experimentally determined. The CUSUM statistic follows the incoming packet rate increase due to a DoS attack (Fig. 1b). When the CUSUM exceeds an appropriate threshold, a DoS attack is detected (Blazek et al., 2001).
4.2.
Wavelet analysis of CUSUM statistic
DoS threshold settings and false alarm rates for a CUSUM detection approach are provided in Blazek et al. (2001). To lower false alarm rates and increase true detection rate, our
The CUSUM algorithm in Eq. (3) is optimal when the timeseries Npt(k)’s pre-change and post-change distributions are independent and identically distributed. Since no model of the arriving process Npt(k) is available, the i.i.d. assumption of Eq. (1) does not hold. Furthermore, the pre-attack and post-attack change distributions, P0(Npt(k)) and P1(Npt(k)), respectively, cannot be determined. The CUSUM of Eq. (3) cannot be applied directly to time-series Npt(k) to detect a DoS attack. Blazek addressed this issue by proposing the following non-parametric CUSUM-based method (Blazek et al., 2001): þ (5) SðkÞ ¼ Sðk 1Þ þ Npt ðkÞ mðkÞ ; Sð0Þ ¼ 0 where m(k) is the estimated average packet rate given by the recursion for 0 e < 1: mðkÞ ¼ emðk 1Þ þ ð1 eÞNpt ðkÞ;
mð0Þ ¼ 0
(6)
S(k) is sensitive to changes in input Npt(k). Before a change, S(k) is a zero mean process with bounded variance. After a change, S(k) will have a larger variance due to addition of attack traffic, and will become a positive process with growing variance. At each k, S(k) is the cumulative sum of the difference between current and (estimated) average packet rates. S(k) increases (decreases) when the packet rate increases (decreases) more quickly then the average rate. The non-linear operator ($)þ limits S(k) to positive values, since decreasing arrival rates are of no significance in detecting a DoS attack. To reduce high frequency noise in the arrival process, Npt(k) is low-pass filtered using a windowed averaging technique, 0 < a 1: ~ pt ðkÞ ¼ aNpt ðkÞ þ ð1 aÞN ~ pt ðk 1Þ; N
~ pt ð0Þ ¼ 0 N
(7)
Fig. 1 – (a) ns-2 packet time series with a DoS attack starting ~ (c) 6th level wavelet at k [ 6 3 104; (b) CUSUM statistic S(k); ~ decomposition coefficient (d6,0) of the CUSUM statistic S(k).
604
computers & security 25 (2006) 600–615
approach uses wavelet analysis of the modified CUSUM statistic Eq. (8). Wavelet analysis describes input signals in terms of frequency components rather than statistical measures. Note, wavelets provide concurrent time and frequency description, which may determine the time at which certain frequencies components are present. These abilities will lower detection time and increase accuracy due to time localization and filtering capabilities, respectively. Fourier analysis, in contrast, indicates if certain frequency components are present and is better suited to periodic signals whose frequency content is stationary. The property of vanishing moments gives some wavelets the ability to suppress polynomial signals. Note, packet time-series contain many traffic dynamics, most of which are non-periodic or periodic over time-varying intervals. Wavelets are well suited for analysis of packet time-series. Wavelets are oscillating functions defined over small intervals of time. The family of discrete (orthogonal) wavelets function used for analysis is (9) Jj;l ðkÞ ¼ 2j=2 J 2j ðk lÞ ; l; j˛Z
where j is the scaling index, l is the time translation index, k is the time interval, and J is the mother wavelet. We used the Haar ‘mother wavelet’ for its simplicity. Scale j is bounded by 2j M, where M is the number of signal samples available at k. Wavelet Jj,l(k) is zero-mean bandpass function of scale j, localized at time interval k. The effect of j is to compress the time window over which the mother wavelet J is defined. In the frequency domain, time compression is equivalent to stretching the wavelet spectrum and shifting by 2j Jaffard (2001). The higher the value of scaling (wavelet decomposition level WDL) j, the smaller the widths of the bandpass filters, resulting in finer frequency analysis. Adjustment of parameters j and l determines the spectral range of the wavelet filter Jj,l(k). Wavelet analysis is performed through the discrete wavelet transform (DWT). For a signal X(k), the DWT is defined through the inner product of the signal and a wavelet: DWTX ðj; lÞ ¼ dj;l ðkÞ ¼ CXðkÞ; Jj;l ðkÞD
(10)
The values dj,l(k) are the discrete wavelet coefficients of X(k) at time position l and scale j, where l ¼ 0,.,2j 1 and capture the localized energy present in the wavelet’s bandpass range. If the coefficients contain ‘enough’ energy in a spectral range of interest, then a signal composed of these frequency components is declared as being present. Haar wavelets calculate their coefficients iteratively through averaging and differencing of two data values. From these two operations, low-pass (average) and high-pass (difference) signal characteristics are stored in the Haar wavelet coefficients (Nievergelt, 1999). Depending on the scale j, the data values are either from the input signal ( j ¼ 1), or previously obtained wavelet coefficient values ( j > 1). As an example, high-pass filter coefficient dj,0 is the difference between the two coefficients containing the average signal value over [k, (k þ 2j)/2] and [(k þ 2j)/2, (k þ 2j)], respectively. Coefficient dj,0 denotes the average amount of signal change over the time window of 2j samples. Wavelet analysis can separate out a time-varying signal of interest from ‘noise’ through localized filtering. At each time instance, wavelet analysis captures the energy of the active
frequency components in multiple, non-overlapping spectral windows. Ideally, the signal and noise components will be captured in separate spectral windows. In the detection method of Barford et al. (2002), wavelet analysis is used to separate a DoS ‘signal’, N1pt ðkÞ, from background traffic ‘noise’, N0pt ðkÞ, both of which are present in the input time series Npt(k) of Eq. (1). The DoS signal N1pt ðkÞ is an abrupt, positive increasing signal. Abrupt signal changes and noise contain most of their energy in high-frequency components. For packet time-series Npt(k), DoS attack detection by direct wavelet analysis may be susceptible to high failure rates. If Npt(k)’s high-frequency noise is nontrivial, the (DoS) signal may not be detectable through the coefficients, due to a low signal-to-noise ratio (Wang, 1995). Analysis of the coefficients’ values may lead to high false rates or missed detections. Thus, to improve detection efficiency, change-point detection via wavelet analysis of ~ ~ the modified CUSUM statistic, SðkÞ, is performed. In SðkÞ, Npt(k)’s arrival noise is filtered through parameters a and ce, enhancing the DoS’s change-point. DoS attacks cause a sharp increase in the CUSUM statistic ~ SðkÞ, followed by a linear increase. Detecting this event is equivalent to change-point detection. Wavelets have been successfully applied to change-point detection in time series (Basseville and Nikiforov, 1993; Antoniadis and Gijbels, 1997; Wang, 1999; Wang, 1995; Lavielle, 1999), and generally outperform traditional statistical methods. By appropriate selection of scale j, the change-point’s signal energy is concentrated in a few coefficients. ~ 0 s change-point Our approach uses wavelets to identify SðkÞ due to a DoS attack. With a vanishing moment of one, the Haar wavelet coefficients dj,l capture abrupt and linear increases of the CUSUM. When dj,l, is larger than a threshold value, a DoS attack is detected. In Fig. 1c, the wavelet coefficient (d6,0) of the CUSUM is shown. d6,0 significantly increases at the CUSUM change, which is due to a DoS attack at interval k ¼ 6 104.
5.
Packet time-series collection
Test data were collected from multiple sources: (i) Network simulators, (ii) DARPA evaluation test sets (MIT), (iii) isolated laboratory testing, and (iv) live data captured from operational networks. Neither (i), (ii), nor (iii) provide test sets representative of operational networks, for reasons including: poor mathematical modeling (Floyd and Paxson, 2001; Willinger and Paxson, 1998), limited topology configurations, short data lengths, and low traffic rates (McHugh, 2000). In particular, the background traffic does not reflect naturally occurring traffic dynamics. Higher amounts of burstiness and timeof-day variations are needed (Leland et al., 1993). Accurate modeling of realistic network traffic is an open question. We present limited results based on synthetic data and focus our evaluation based on live captured data.
5.1.
ns-2 simulation
Simulated time series were constructed using the ns-2 simulator (VINT Project). Our ns-2 simulations had 75 nodes, 15 of which could be DoS attack victims. The topology contained highly connected interior nodes surrounded by simply
computers & security 25 (2006) 600–615
605
connected edge nodes. The interconnection between interior nodes used high bandwidth links (100–233 Mbps). Low bandwidth (10–33 Mbps) connections were used between the edge and interior nodes. Zombies were edge nodes. Targets were highly connected interior nodes. This topology is qualitatively similar to the Internet (Brooks, 2005). Background traffic was a mix of bursty and constant bit rate (CBR) TCP traffic. The bursty traffic used a Pareto distributed random variable, a packet size of 2048, burst time of 10 ms, and shape of 1.5. Background CBR traffic was built with 2048 byte packets every 10 ms. DoS attacks generated CBR traffic of 404 byte UDP packets every .5 ms. Fig. 1 is a UDP flood attack produced by 20 zombies targeting a single victim. The time-series time interval is 10 ms. The attack produces an abrupt 300% traffic increase at interval k ¼ 6 104. More prominent is the lack of burstiness present in the ns-2 background traffic. This is an issue present in synthetic time-series.
5.2.
Live network traffic
To address our concerns with simulated data, live TCP data were collected from Penn State University’s operational network. Traffic was passively collected at a subnet firewall using tcpdump (TCPDump Group). Test data were collected for two weeks yielding 119 packet time-series. Each sample averaged 8 h in length. Privacy issues only allowed for counting of the number of TCP packets and their arrival timestamp. From each recording, a time-series counted the number of TCP packets within consecutive, uniformly sized, time intervals of 1 s (Fig. 2a).
5.3.
DoS attack modeling
Our ‘live’ time-series have higher traffic rates and burstiness than the ns-2 simulations (Fig. 1a). We did not identify any DoS flood events in the raw data. Like the testing approach in Feinstein et al. (2003), we augmented operational timeseries with a DoS attack modeled as an abrupt traffic surge or step function of random length. Additionally, randomly selected sections of other time-series were added as noise during the attack’s duration (see Appendix A). Actual DoS attack data would be preferable, and we are continuing our monitoring efforts to collect data samples containing actual network attacks and representative background traffic. Fig. 3 shows a live time-series containing a model DoS attack of scale 7 (see Appendix A). Our DoS event is qualitatively similar to the ns-2 (Fig. 1) and DARPA/MIT (Fig. 4, MIT) simulations. The DoS attack is a steep increase in average packet rate, overlaid on operational traffic of high variance and rate. These traffic characteristics are not present in the synthetic data (ns-2, DARPA/MIT), and are representative of operational Internet traffic (Floyd and Paxson, 2001; Willinger and Paxson, 1998; McHugh, 2000).
6.
Detection method implementation
The CUSUM and wavelet analysis algorithms were implemented in C. Sun (Windows) platform support is provided
Fig. 2 – (a) Live packet time series; (b) CUSUM statistic; (c) wavelet decomposition coefficient d6,0.
through the PCAP (WinPcap) libraries. ns-2 or tcpdump file formats are accepted. All input time-series used in testing were in saved file format. PERL and MATLAB scripts automated testing, analysis and data presentation.
7.
Testing procedures
Four test data sets of multiple time series were formed. One data set was from ns-2 simulations and three from live time-series. The size of the ns-2 and live test data sets was 8 and 238 time-series, respectively. Half of the time-series within each data set contain a single DoS attack. The remaining time-series contain no attacks. Each data set is summarized in Table 1. The column scale represents the multiplicative factor by which the average packet rate increases due to a DoS attack. For the Sns-2 data set, scale was three; the DoS attack by 20 zombies produce traffic (30 packets per second) that is triple that of the background rate (10 packets per second). In the live data sets, the average traffic rate increase caused by a DoS attack is determined by the
606
computers & security 25 (2006) 600–615
Fig. 3 – Live time-series with a modeled DoS attack. (a) Time series; (b) CUSUM statistic; (c) coefficient d10,0. Fig. 4 – DARPA/MIT synthetic time-series.
parameter scale (Appendix A). Data sets SLIVE4, SLIVE7, and SLIVE10 were constructed using scale values of 4, 7, and 10, respectively. Each time-series s within data set Si, where i ˛ {ns-2, LIVE4, LIVE7, LIVE10}, was evaluated by the two-stage detection method. The algorithm is adjusted by a set of operating parameters P. The elements of set P include: -
Wavelet decomposition level (WDL), Wavelet coefficient threshold (T ), CUSUM estimated packet rate memory (e), CUSUM local averaging memory (a), and CUSUM noise correction factor (ce).
The detection method outputs the number of DoS events detected within time-series s under test. This is equivalent to the number of times the wavelet decomposition coefficients exceed the wavelet threshold T. The wavelet decomposition level and threshold is defined in set P. DoS analysis of the data sets was performed over multiple sets of P. In each testing iteration of time-series s, only one element of P was varied while the others remained
constant. The range of the single parameter variation was determined through experimentation. For any test data set Si, where i ˛ {ns-2, LIVE4, LIVE7, LIVE10}, optimum detection rate is achieved under those conditions P which jointly maximize the true detection peri ðPÞ and minimize the false positive percentage centage pStrue Si i i ðPÞ and pSfalse ðPÞ is provided in pfalse ðPÞ. The derivation of pStrue i ðPÞ is maximized Appendix B. The set P ¼ P* under which pStrue i ðPÞ is minimized is the detector’s optimum operating and pSfalse i i ðP Þ represent the detecðP Þ and pSfalse point. The values pStrue tor’s best overall average performance.
Table 1 – Test data sets Label
Source
Number of time-series
Scale
Number of DoS events
SNS2 SLIVE4 SLIVE7 SLIVE10
ns-2 Live Live Live
8 238 238 238
3 4 7 10
4 118 118 118
computers & security 25 (2006) 600–615
8.
Testing results
Receiver Operating Curves (ROC) illustrate our test results. The detection method’s false positive vs. true detection percentage is plotted against a varying parameter from P. The ROC illustrate the performance gain between detection methods, the effects of parameter variations, and the best possible detection efficiency. Each ROC is built from the values of the dyadic set i i ðPÞÞ. A datapoint represents the likelihood of ðPÞ; pStrue ðpSfalse detecting a true DoS attack or a false positive. Probabilities were determined by testing all data set samples under a common parameter set P. The datapoint with minimum distance to the upper left point (0, 1) represents the best possible detection efficiency. Each test iterations varied one parameter value from P. At the i i ðPÞÞ, the varying element ðPÞ; pStrue most efficient value of ðpSfalse from P is recorded. This parameter value is the best possible and becomes an element of the set P*. For subsequent parameter variation testing of the same data set, parameters defined in P* are used in P. After all parameters of P have been exercised, the set P* is complete. Set P* determines the detection’s method operating parameters which produces the maximum true and minimum false positive rate that can be jointly obtained.
8.1.
ns-2 simulations
For ns-2 data sets, detection efficiency was ideal (Fig. 5). All DoS attacks within the data set were true detected with zero false positives. This is visually evident as each ROC plot has -2 ns-2 a datapoint at (0, 1), representing pns false ðPÞ ¼ 0%, ptrue ðPÞ ¼ 100%. The set P corresponding to this datapoint is the best
Fig. 5 – Efficiency vs. wavelet threshold for ns-2 simulations. Varying parameters: wavelet threshold (T ) and wavelet decomposition level (WDL); static parameters: CUSUM estimate average memory (e) [ 0.97, CUSUM local average memory (a) [ 0.99, CUSUM correction factor (ce) [ 0.20.
607
Table 2 – ns-2 parameter set P* Parameter
Setting
Wavelet decomposition level (WDL) Wavelet coefficient threshold (T ) CUSUM estimated average memory (e) CUSUM local averaging memory (a) CUSUM correction factor (ce)
{5,6,7} [30,130] 0.97 0.99 0.20
possible parameter set P*. In Fig. 5, each ROC trace uses a unique setting for the wavelet decomposition level (WDL) over a varying wavelet threshold (T ). CUSUM parameters variations (e, a, c) were not investigated, as no increase from ideal detection performance would be gained. For the ns-2 data set, the optimal parameters set P* is stated in Table 2. Detection efficiency under set P* is ideal. The ideal performance is suggested to be an artifact of inadequate ns-2 traffic modeling. These issues have been discussed in earlier sections.
8.2.
Live traffic data sets
Testing results from the live data sets are more representative of the detection method’s field performance. Affected by realistic background traffic, the detector’s ability is more challenged. True and false positive detection percentages are seen to be less than ideal. Discussion of distinctly high false positive rates are deferred until Section 9.
8.2.1.
Statistical (CUSUM) vs. wavelet thresholding
Wavelet analysis of the CUSUM statistic is proposed to obtain higher detection performance. To support this claim, live data
Fig. 6 – Efficiency vs. CUSUM/wavelet thresholding. Varying parameters: wavelet threshold (T ), CUSUM threshold; static parameters: wavelet decomposition level (WDL) [ 6th, CUSUM estimated average memory (e) [ 0.988, CUSUM local averaging memory (a) [ 0.97, CUSUM correction factor (ce) [ 0.20.
608
computers & security 25 (2006) 600–615
sets were analyzed under both CUSUM and wavelet thresholding. For CUSUM thresholding, DoS detection is equivalent to ~ SðkÞ exceeding a threshold TCUSUM, which is the basis for detection in Blazek et al. (2001). Wavelet thresholding evaluates the coefficients dj,l(k) against wavelet threshold T. dj,l(k) is ~ obtained from wavelet analysis of SðkÞ. Fig. 6 is a ROC with two traces. The lower trace provides detection likelihoods under CUSUM threshold variations, while the upper trace denotes wavelet threshold variations. Datapoints on the curves are measures of average detection efficiency. An increase in either threshold T or TCUSUM can lower both the true detection and false detection percentages. The distance between the two curves is of interest. The data set was SLIVE4. Datapoints for the wavelet thresholding trace lie closer to ideal point (0, 1). Higher detection efficiency is achieved from the additional wavelet processing of the CUSUM. Extracted from the ROC of Fig. 6, Table 3 shows the maximum average detection rates for each analysis method. The ratio of true detections vs. false positives percentages for the wavelet analysis is 56% more efficient than CUSUM processing alone. Wavelet analysis also has equal or greater average true detection rates for a given false positive rate. This was also seen for data sets SLIVE7 and SLIVE10 (not shown).
8.2.2.
Fig. 7 – Efficiency vs. wavelet decomposition. Varying parameters: wavelet threshold ( T ) and wavelet decomposition level (WDL). Static parameters: CUSUM estimated average memory (e) [ 0.988, CUSUM local averaging memory (a) [ 0.97, CUSUM correction factor (ce) [ 0.20.
Variation to wavelet decomposition level (WDL)
Wavelet analysis of the CUSUM statistic provides better detection efficiency than CUSUM analysis alone. Wavelet analysis is also capable of enhanced change-point detection through its scaling parameter, the wavelet decomposition level (WDL). WDL equals the value of j in Eq. (9). Fig. 7 shows the detection analysis of the SLIVE4 data set over variation of wavelet decomposition level (WDL) and threshold (T ). Each point represents a different wavelet threshold value, whereas each curve represents a different wavelet decomposition value (WDL). The 6th level of decomposition was the baseline, since it was determined earlier to be better than CUSUM thresholding. Increases in WDL produces datapoints closer to (0, 1), and thus better detection ratios. A maximum is reached at WDL ¼ 10th. Further WDL increases have an adverse affect, producing datapoints which lie below the 6th level. Higher WDL performs poorly in low signal-to-noise environments (Wang, 1995). Table 4 summaries the true detection and false positive rates for WDL ¼ 6th, 10th, and 12th.
8.2.3.
Detection efficiency uncertainty
i ðPÞ The uncertainty of average detection rates is defined by eStrue Si and efalse ðPÞ (see Appendix B). This is shown by the error bars superimposed on the ROC datapoints of Fig. 8. For low threshold values, the error bars show about a 10% spread around the datapoint. The horizontal error bars representing false positive uncertainty increases as the threshold is lowered. At lower values of wavelet thresholding, noise is more likely to causes a false positive, leading to more spread in detection uncertainty.
8.2.4.
Variation of DoS attack strength
The detection method was evaluated against multiple DoS scales. DoS scale is a measure of traffic increase incurred during an attack. The higher the scale, the stronger the attack is. data sets SLIVE4, SLIVE7, and SLIVE10 were constructed with DoS scale factors of 4, 7, and 10, respectively (Appendix A). Fig. 9 shows detection results for WDL ¼ 10th across the three live test data sets (SLIVE4, SLIVE7, SLIVE10). As expected, when the DoS scale is increased, the ROC traces approach the upper left corner, indicating in better detection efficiency. The kink in trace of SLIVE10 needs further
Table 3 – Wavelet vs. CUSUM best detection rate Analysis
CUSUM only CUSUM with wavelet Detection ratio increase (%)
True detection rate (%)
False positive rate (%)
Detection ratio
15 40
18 30
0.83 1.3 56
Table 4 – WDL variation on detection percentages WDL
Threshold
True detection rate (%)
False positive rate (%)
Ratio
6th 10th 12th
43,000 30,000 6000
39 47 30
30 21 39
1.3 2.2 .7
609
computers & security 25 (2006) 600–615
Fig. 8 – Efficiency vs. wavelet thresholding. Varying parameters: wavelet threshold ( T ); static parameters: CUSUM estimated average memory (e) [ 0.988, CUSUM local averaging memory (a) [ 0.97, CUSUM correction factor (ce) [ 0.20.
Fig. 9 – Efficiency vs. DoS scaling. Varying parameters: wavelet threshold ( T ); static parameters: wavelet decomposition level (WDL) [ 10th, CUSUM estimated average memory (e) [ 0.988, CUSUM local averaging memory (a) [ 0.97, CUSUM correction factor (ce) [ 0.20.
investigation. Table 5 shows the maximum detection efficiency per DoS scale variation.
All parameters for the statistical (CUSUM) and wavelet analysis have been evaluated. Table 6 states the resulting parameter set P* acquired from testing our live data sets.
8.2.5.
Variation of CUSUM parameters
The first-stage of processing used the CUSUM statistic of Eq. (8) with three parameters: estimated average memory (3), local averaging memory (a), and a noise correction factor ce. Each of these parameters was varied to determine their effect on detection efficiency. The SLIVE7 data set was used, along with best possible wavelet parameters determined from previous sections: WDL ¼ 10th, wavelet threshold T ¼ 30,000. (a) Estimated average memory (3): parameter 3 determines the amount of past history used in estimating the average packet rate m(k), which is a dominant term in the CUSUM ~ statistic SðkÞ. The variation of 3 on detection efficiency is shown in Fig. 10. The best setting for parameter 3 is 0.98811. (b) Local averaging (a): parameter a determines the amount of past history used to filter arrival noise within input packet time-series Npt(k). The variation of a on detection efficiency is shown in Fig. 11. Below 0.11, the true detections and false positive rates linearly approach 0. Little variation in true detection rate is seen for a values between 0.22 and 1. The best value for a is 0.22. (c) Correction factor (ce): parameter ce reduces the variance of the input traffic on the CUSUM statistic. Its variation on detection efficiency is shown in Fig. 12. The optimum setting for ce is about 0.13. The removal of noise and variance from the arrival process by parameter ce is seen as having a positive affect on detection percentages.
8.3.
Detection delay
Another useful performance metric is detection delay. For a time series in which both the CUSUM and wavelet thresholding indicated a DoS attack, a delta delay measurement was recorded. Detection delay is only valid when both the wavelet and CUSUM threshold were crossed. Let k1 and k2 indicate the time interval at which the wavelet and CUSUM processing, respectively, have first declared a DoS attack:
(11) k1 ¼ min arg dj;0 ðzÞ T ; z˛Zþ z
h i ~ TCUSUM ; k2 ¼ min arg SðzÞ
z˛Zþ
(12)
z
where j is the wavelet decomposition level. The delta delay measurement is defined as d ¼ k2 k1. A positive d indicates how much earlier the wavelet analysis detected the DoS event relative to the CUSUM only processing.
Table 5 – DoS scale performance DoS scale data set True detection False positive Ratio rate (%) rate (%) 4 7 10
SLIVE4 SLIVE7 SLIVE10
46 68 78
21 24 25
2.1 2.8 3.1
610
computers & security 25 (2006) 600–615
Fig. 10 – Efficiency vs. e variation. Varying parameter: CUSUM estimated average memory (e). Static parameters: wavelet decomposition level (WDL) [ 10th; wavelet threshold ( T ) [ 30,000, CUSUM local averaging memory (a) [ 0.97, CUSUM correction factor (ce) [ 0.20.
Fig. 11 – Efficiency vs. a varying parameters: CUSUM local averaging memory (a); static parameters: wavelet decomposition level (WDL) [ 10th, wavelet threshold ( T ) [ 30,000, DoS scale [ 7, CUSUM estimate average memory (e) [ 0.988, CUSUM correction factor (ce) [ 0.20.
The set of d measurements were captured from the SLIVE4 data set. The number of samples was limited (