220 35 41MB
English Pages 480 [458] Year 2020
Storing Digital Binary Data in Cellular DNA The New Paradigm
Dr. Rocky Termanini
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-323-85222-7 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Ana Claudia A. Garcia Production Project Manager: Sruthi Satheesh Cover Designer: Victoria Pearson Typeset by TNQ Technologies
I dedicate this book to the young generation of the Edisons, Turins, Wieners, Hawkinses, Venters, Watsons, Cricks, Churches, Franklins, Doudnas, Charpentiers, Zhangs, Ehrlichs, Zielinskis, and the rest of the genetics pioneers such as Illumina, TwistBioscience, Sandia National Lab, University of Washington with Microsoft Research (Molecular Information System Lab) who are still in the lab working to enhance our living.
About the author Dr. Rocky Termanini, CEO of MERIT CyberSecurity Group, is a subject matter expert in IT security, artificial intelligence, nanotechnology, machine and deep learning, and DNA digital storage. He is a member of the San Francisco Electronic Cyber Crime Task Force. He brings 46 years of cross-industry experience at national and international levels. He is the designer of the “Cognitive Early-Warning Predictive System” and “The Smart VaccineÔ ,” which replicates the human immune system to protect the critical infrastructures against future cyberwars. Dr. Termanini spent 5 years in the Middle East working as a security consultant in Saudi Arabia, Bahrain, and the UAE. Professor Termanini’ s teaching experience spans over 30 years. He has taught Information Systems courses at Connecticut State University, Quinnipiac University, University of Bahrain, University College of Bahrain, and Abu Dhabi University and has lectured at Zayed University in Dubai.
xix
Acknowledgments DNA is a magic word and galvanized with so many mysteries. DNA data storage is a field in its infancy and is rapidly growing to be the new paradigm that will shift magnetic storage into molecular storage. I have talked and listened to many people about this subject matter and absorbed a lot of new ideas and learned about the dark corners of this technology, which is just as exciting as the new space discoveries of the Black Hole. I salute these people, and I acknowledge their contributions to the world of DNA. Ms. Lina Termanini Senior Manager, Global Methods, Ernst Young, San Francisco, CA Dr. Zafer Termanini Orthopedic Surgeon, Saint Lucie, Florida Dr. Sami Termanini General Medicine, Dublin, Ireland Ms. Mia Termanini Williams Business Consultant, Saint Lucie, Florida Radwan Termanini Senior Economist, Doha, Qatar Samir Termanini Attorney at Law, Computer Science Expert, Newark, New Jersey Dr. Eesa Bastaki President of University of Dubai, Dubai, UAE Dr. Bushra Al Blooshi Director of Research and Innovation, Dubai Electronic Security Center, Dubai, UAE Dr. Charles J. Bentz Fanno Creek Clinic, Portland OR Dr. George M. Church Professor of Genetics, DNA storage inventor, Harvard Medical School, Boston, MA Dr. Feng Zhang Biochemist CRISPR inventor, MIT, Boston, MA Dr. Abdul Rahim Sabouni President & CEO of the Emirates College of Technology (ECT), Abu Dhbai UAE Dr. Hussain Al-Ahmad Dean of the College of Engineering at University of Dubai Dr. Sameera Almulla Board Member of the Emirates Science Club, Dubai, UAE
xxi
xxii
Acknowledgments
Dr. Howard Zeiger Internal Medicine, John Muir Medical Group, Alamo, CA Dr. Matthew DeVane Physician, Cardiovascular Consultants Medical Group, San Ramon, CAAlison Ryan, PA-C Diablo Dr. Robert Robles Valley Oncology and Hematology, California Cancer and Research Institute, Pleasant Hill, CA Dr. Siripong Malasri Dean of Engineering, Christian Brothers University, Memphis, TENN Dr. Judson Brandeis Brandeis MD Clinique, San Ramon, CA Colonel Khaled Nasser Alrazooki Ministry of Interior, Dubai, UAE Ms. Akila Kesavan Executive Director, Ernest Young, San Francisco, CA Dr. Yigal Arens Director USC/Information Science Institute, Los Angeles, CA Ms. Kristina Nyzell CEO, Disruptive play Consulting, Malmo, Sweden Meshal Bin Hussain CIO, Ministry of Finance, Abu Dhabi, UAE John & Danielle Cosgrove Cosgrove Computer Systems, El Segundo, CA Ms. Amna Almadhoob Senior Security Researcher, AMX Middle East, Bahrain I also would like to thank the rest of my visionary and creative friends for their gracious assistance. Finally, I am indebted and graciously thank my family and all the people who own part of this book. My special thanks and gratitude go to Academic Press staff (Ana Claudia Abad Garcia, Sruthi Satheesh, Chris Katsaropoulos, and Swapna Praveen) who gave all of us the chance to enjoy reading the book. Dr. Rocky Termanini, CA 94595
Acknowledgments
xxiii
CRISPR pioneers (from left to right): George Church, Jennifer Doudna, Feng Zhang, and Emmanuelle Charpentier. My highest esteem to these four DNA God’s ambassadors for their contribution to humanity and giving hope to doomed patients for a healthy life. They advanced Biomedicine 100 years in the future.
Prologue The most remarkable property of the universe is that it has spawned creatures able to ask questions.
dStephen Hawking, Illustrated Theory of Everything: The Origin and Fate of the Universe The first ultra-intelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.
dIrving J. Good, 1963 Microsoft CEO Satya Nadella says underwater data centers will play a major role in expanding the firm’s global cloud computing platform.
DNA is much more than the sum of two words. Deoxyribonucleic acid is the biological schematic that dictates the shape of our cheekbones, and whether we sneeze and wheeze every spring when pollen pops. It is the miracle that reminds all of us when Jesus spoke to the man who could not walk. “I tell you,” He said, “get up. Take your mat and go home.” The man got up and took his mat. Then he walked away while everyone watched. All the people were amazed. They praised God and said, “We have never seen anything like this!” dMark 2:6e12 It all started when I read Dr. George Church’s book Regenesis, How Synthetic Biology Will Reinvent Nature and Ourselves, which sparked a real obsession in me to learn more about how DNA works and how artificial intelligence (AI) and nanotechnology have pushed the envelope. The book is truly the most compelling bit of prophecy since the Old Testament first came out in hardback. I then read about the magic of CRISPR (clustered regularly interspaced short palindromic repeats) and how it ignited a revolution. Dr. Feng Zhang from MIT is the master of CRISPR, the formidable gene editing tool. In 2012, Dr. Jennifer Doudna and Dr. Emmanuelle Charpentier were the first to propose that CRISPR/Cas9 (enzymes from bacteria that control microbial immunity) could be used for programmable editing of genomes, which is now considered one of the most significant discoveries in the history of biology. It is mind-blowing to realize that in just 5 years, CRISPR could be used to eliminate mutated genes in a patient with muscular dystrophy (MS) and insert healthy genes into his DNA. Pretty soon, his muscles will be rejuvenated, and he will be able to walk! We can softly say that CRISPR is the hand of Jesus, resurrected. As I get deeper in my DNA research, I have come up with an amazing mind-stretching revelation: DNA nucleotides (the building blocks, ACTG) can be used not only to make body protein but also to encode binary data from our business world. This is going to push aside the world of magnetic and silicon storage. The origin of this idea goes back to 1964 when Mikhail Neiman, a Soviet physicist, published his works in the journal Radiotechnika. Neiman extended his innovation into the possibility of recording, storage, and retrieval of information on DNA molecules.
xxv
xxvi
Prologue
The physicist explained that he had the idea from an interview with Norbert Wiener, an American cybernetic, mathematician, and philosopher, published in 1964. Innovation is like a wave, you cannot stop it, but you can ride it. Here is a story about innovation: The Polynesians were master navigators who traveled with neither compass nor sextant. They learned to read the pattern formed by waves. They observed that when waves hit an island, some are reflected back while others are deflected but continue on in a modified form. Each navigator used the motion of the canoe to feel the wave across the ocean. My research led me to the journal Science, which published a research paper by Dr. George Church and colleagues at Harvard University. In the paper, the author described an experiment in which DNA was encoded with digital information that included an HTML draft of a 53,400-word document of Dr. Church’s book Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves, into a fourletter code based on the four DNA nucleotides (A, T, C, and G). It took 5.5 petabits (1015) to store the book in 1 mm3 of DNA! Dr. Church used a simple translation code where binary bits were mapped one-to-one with DNA base pairs. The research results showed that in addition to its other functions, DNA can also be another type of storage medium, such as hard drives and magnetic tapes. The March 2018 issue of MIT Technology Review magazine published a detailed article discussing the work of Dr. Andrew Phillips, head of the Biological Computation Group at Microsoft who is currently developing methods and software for understanding and programming information processing in biological systems. The article described a new disrupting technology where DNA could be the new habitat of binary data. My focus shifted into how we can fuse the Digital Immunity Ecosystem (DIE)dwhich is a replication of the human immune systemdwith DNA binary data storage. I wrote Dr. Bastaki, president of the University of Dubai, and mentioned the exciting news about DNA as the new paradigm of data storage and suggested we write a paper on how we can use DNA binary storage as the future storage facility. He replied: “Truly, this is an excellent work.” As a matter of fact, DDS will be a critical component of the DIE, the futuristic molecular security system to protect the critical systems in the city. He encouraged me to write a paper highlighting the benefits of DNA storage technology, which will replace magnetic and silicone storage options. DNA will have all the advantages that come with storage capacity, integrity, and authentication. This is a marvelous project of innovation and creativity. Innovation rocksdit is a process and not an accident. So, I decided to write this book and highlight the storage issues of our present digital universe, which are manifested by an exponential out-of-control growth. Organizations worldwide, large and small, whose IT infrastructures transport, store, secure, and replicate these bits, have little choice but to employ ever more sophisticated techniques for information management, security, search, and storage. The excursion of data storage initiated from bones, rocks, and paper. Then this journey deviated to punched cards, magnetic tapes, gramophone records, floppies, and so forth. Afterward, with the development of the technology, optical discs including CDs, DVDs, Blu-ray discs, and flash drives came into operation. All of these are subjected to decay. Being nonbiodegradable materials, they pollute the environment and release high amounts of heat energy while using energy for operation. The world keeps crunching datadand dancing to its musicdwhile our digital universe is slipping into the sunset. The storage technology vendors are looking vertically, with myopic vision into diversifying and changing labels of their products. In the past year, signs of real change have become more visible, and recent news have brought to light alarming issues for every silicone user. The shortage of silicone supply is driving higher prices, and global demand is expected to continue, and many big elite storage companies (Pure Storage, Veritas, Western Digital, Seagate, NetApp. Hewlett Packard Enterprise,
Prologue
xxvii
Hitachi Data Systems) and major suppliers remain concerned about their supply chain in the short term and the bottom-line impact in the future. Today, data have become the new oil, meaning that we no longer regard the information we store as merely a cost of doing business, but a valuable asset and a potential source of competitive advantage. It has become the fuel that powers advanced technologies such as machine learning (ML). To meet the demand of big data storage, companies such as Microsoft, IBM, Facebook, and Apple are looking beyond silicon for solutions. The next-generation data storage market is expected to be valued at $144.76 billion by 2024.
The big escape to the cloud We are now in the middle of a storage war, creating confusion and havoc among leaders of the business world. The big computing companies are luring their strategic customers to migrate to their cloud. Leaders are actively refactoring legacy applications and developing cloud-native applications from the ground up. The sale cliche´ is: “Start with a cloud-native approach and build, modernize and migrate without being locked-in. Unlock the value of your data in new ways and accelerate your journey to AI.” Basically, cloud computing is a model or infrastructure that provisions resources dynamically and makes them available as services over the Internet. A cloud service includes infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). There are three major types of cloud computing models: private clouds, public clouds, and hybrid clouds. The key criteria for evaluation include the underlying infrastructure, redundancies, connectivity, uptime, service-level agreement (SLA), and turnaround services. These six critical “performance” areas must be thoroughly examined by organizations before they fully jump into the cloud. The fundamental benefits of the “as a service” model are well known and include a shift from capital to operational expenditure (capex to opex), often leading to lower total cost of ownership (TCO); access for businesses of all sizes to up-to-date technology, maintenance by service providers that can leverage economies of scale; scalability according to business requirements; fast implementation times for new applications and business processes; and the freeing up of staff and resources for other projects and priorities. In 1963, Sam Wyly founded University Computing Company (UCC) as a data processing service bureau on the campus of Southern Methodist University in Dallas, Texas. It was the first pseudo clouddprior to Internetdthat offered two software as service options: tape management system and job scheduler. The cloud gained popularity as companies gained a better understanding of its services and usefulness. In 2006, Amazon launched Amazon Web Services, which offers online services to other websites, or clients. The same year, Google joined the cloud with a spreadsheet and word processor. Then IBM, Microsoft, and Oracle jumped on the cloud bandwagon.
Cloud hardware Beginning in the 1960s through the 1990s and up to the commercialization of the Internet, service bureaus relied on water-cooled time-sharing legacy systems. Telecommunications companies primarily offered only dedicated, point-to-point data circuits to their users. Beginning in the 1990s, however, they began expanding their offerings to include virtual private network (VPN) services.
xxviii
Prologue
Then the greed in embracing the dot.com market spread, climaxing in June 2000 with the burst of the bubbledhalf of the dot.com companies vanished. Layoffs of programmers resulted in a glut in the job market. Office equipment liquidation turned sour. University enrollment for computer-related degrees dropped noticeably. Anecdotes of unemployed programmers going back to school to become accountants or lawyers were common. New companies, such as Amazon, Google, IBM, and Microsoft, started to deploy the new cloud model, “Everything as a Service (XaaS),” supported by massive hyperscale data centers. Since the invention of the integrated circuit approximately 60 years ago, computer chip manufacturers have been able to pack more transistors onto a single piece of silicon every year. Moor’s law worked fine for 40 years while the number of transistors was doubling every 24 months. The first chip had 10 transistors (10 microns); today the most complex silicon chips can now hold a billion times more transistors (12 nanometers), and the cost of a chip went from millions to billions. Due to the severe limitations in packing, chip design, cost, power density, and clock speed all suffered. Tech companies threw in the towel on further shrinking line widths due to the diminishing returns. This also means that the future direction of innovation on silicon is no longer predictable. It is worth remembering that human brains have had 100 billion neurons for at least the last 35,000 years. Yet we have learned to do a lot more with the same computer power. The same will hold true with semiconductorsdwe are going to figure out radically new ways to use those 10 billion transistors.
Why is silicone important? Being aware of these painful constraints, the cloud data center changed direction and decided to move into the all-flash domain. An all-flash array (AFA) is a storage infrastructure that contains only flash memory drives instead of spinning-disk drives. All-flash storage is also referred to as a solid-state array (SSA). AFAs and SSAs offer speed, high performance, and agility for your business. Although modern data centers are looking to AFAs as a solution to performance and capacity demands, not every AFA is created equal. It is important to understand the difference between purpose-built arrays and retrofit arrays. Retrofits attempt to combine all-flash with 20-year-old disk-based architectures, preventing customers from getting the best return on investment and exposing shortcomings in performance, reliability, and simplicity.
The silicon addiction We love our electronics, and most of us buy a new phone, computer, or laptop every year or two. With this, we expect to buy a faster, more intelligent device. The microchips inside our electronics are currently made up of silicon, an abundant material found in sand. The same silicon chip is the prime component in all our computing devices, from our laptop all the way to AFA or SSA, hard drive controllers, and memory and processing units that make up our digital universe. Hyperscale data centers, both conventional and cloud-centric, will realize sometime soon that silicon-based chips will no longer be able to provide devices with the extra speed and functionality that buyers demand. It will be a frighteningda “fourth of July”dshock to the business world to learn that DNA, the genetic material inside every human cell, is a leading contender to fill silicon’s shoes. All the public, private, hyper, Kubernetes, and containers cloud providers will have a Himalayan storage problem that looms on the horizon of 2022.
Prologue
xxix
One of the remarkable ironies of digital technology is that every step forward creates new challenges for storing and managing data. In the analog world, a piece of paper or a photograph never becomes obsolete, but it deteriorates and eventually disintegrates. In the digital world, bits and bytes theoretically last forever, but the underlying mediadfloppy disks, tapes, and drive formats, as well as the codecs used to play audio and video filesdbecome obsolete, usually within a few decades. Once the machine or medium is outdated, it is difficult, if not impossible, to access, retrieve, or view the file. Yaniv Erlich, assistant professor of computer science at Columbia University and a core member of the New York Genome Center, observes “Digital obsolescence is a very real problem”. “There is a constant need to migrate to new technologies that do not always support the old technologies.”
DNAdthe holy grail We can all respectfully consider the wheel, the most important invention by humankind; while most other inventions have been derived from nature itself, the wheel is 100% a product of human imagination. Now here is another natural invention that will change the computing world. It is hard to imagine that 10 trillion DNA molecules could fit into the size of a marble. The other bombshell is that all these molecules can process data simultaneously. In other words, we can calculate 10 trillion times simultaneously in a small space at one time! Just imagine, in one 0.04 ounce (1 g) of artificial DNA, we can store the data of some 3,000,000 CDs. In an article in the August 2016 issue of Science Magazine, Katherine S. Pollard said, “A mere milligram of the molecule could encode the complete text of every book in the Library of Congress and have plenty of room to spare.” In 2014, researchers published a study in the journal Supercomputing Frontiers and Innovations estimating the storage capacity of the Internet at 1024 bytes, which can fit into a glass cookie jar. If the sky is the limit, DNA comes close to it! https://www.the-scientist.com/magazine/issue/human-evolution-30-8. DNA data storage and computing are true disruptive technologies that stand ahead of all technical inventions combined. This book is moderately technical, and I followed the Ray Kurzweil style in writing his great book, Singularity Is Near. This book is easy to read, but it covers this new paradigm, which will give informatics a new dimension. I imagine in a couple of decades; the business world will have a universe of Internets, and computing will go down to the molecular level driven by AI smart nanobots that will boost our intelligence and conquer many chronic diseases. Hopefully, man and machine will coexist in harmony. During the writing of this book, I modestly admit that this work is the collective effort of many meetings with physicians, geneticists, computer engineers, academicians, biologists, law enforcement agents, and devout religious and agnostic partisans. The algebraic sum of all this was enigmatic yet enlightening. I started out by talking briefly about the human immune system, which is an incredible autonomic ensemble of smart components and arteries that interoperate with perfect orchestration. We will use the human immune system as the reference architecture model for our digital immunity. The Digital Immunity Ecosystem (DIE) is actually a replication of the human immune system. It is built with futuristic and disruptive architecture that represents the next generation of cyberdefensednot just securitydfor smart cities. This book is the intersection of AI, nanotechnology, and ML. I describe the expansion of our digital universe with its gluttony of data storage. I introduce DNA synthesis and sequencing systems to capture and store binary data. I present the model of “fusing” digi-informatics with DNA informatics to create a holistic smart ecosystem. I cover in detail DNA hacking and
xxx
Prologue
cryptography. With every chapter, I include testimonies and references to support my arguments. The combined systems into one homogenous platform rely heavily on DNA computing algorithms, molecular AI, and nanobots energized with ML for predictive analytics. In the end, I coverdwith interesting scientific discussionsdthree important topics that are pathologically tied to DNA: crime, time, and religion. I am confident that this book will be more interesting to computing people than to medical specialists. A disruptive technology such as DNA data storage (DDS) is one that displaces an established technology and shakes up the industry, similar to a groundbreaking product that creates a completely new industry. DDS will start with baby steps and will eventually make sense to the resident of the digital universe. I give credit to and salute all the pioneers who developed this and connected the dots of this marvelous technology. I hope the leaders of the computing major leagues start investing more time, money, and effort in DDS instead of building more hyperscale data centers. Dr. Rocky Termanini, Walnut Creek, CA 94595
CHAPTER
Discovery of the book of lifedDNA
1
When looking for a needle in a haystack, the optimist wears gloves. dThe Little Book of Things to Keep in Mind.
DNA is like a computer program but far, far more advanced than any software ever created. dBill Gates, The Road Ahead.
If at first an idea does not sound absurd, then there is no hope for it. dAlbert Einstein.
DNA is not predestined. Genes are reprogrammable. dScience Magazine (March 21, 2019).
Today we are in science and technology where we humans can reduplicate and then improve what nature has already accomplished. We too can turn the inorganic into the organic. We too can read and interpret genomesdas well as modify them. And we too can create genetic diversity, adding to the considerable sum of it that nature has already produced . dDr. George Church from Genesis.
The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, the carbon in our apple pies were made in the interiors of collapsing stars. We are made of starstuff. dCarl Sagan.
Hmmm . If the DNA of one human cell is stretched out, it would be almost 6 feet long and contain over three billion base pairs. How does all this fit into the nucleus of one cell? Rosalind Franklin: DNA from the Beginning.
Initial thoughts Imagine a world where you could go outside and take a leaf from a tree and put it through your personal DNA sequencer and get data such as music, videos, or computer programs from it. Well, this is all possible now. It was not done on a large scale because it is quite expensive to create DNA strands, but it is possible. Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00001-X Copyright © 2020 Elsevier Inc. All rights reserved.
1
2
Chapter 1 Discovery of the book of lifedDNA
DNAdthe code of life No car can go through the African Sahara without spare tires and ample fuel. We have the same analogy with DNA data storage. No organization, small or large, would successfully compete without deep data lakes and scalable data warehouses to accommodate machine learning and predictive analytics. The idea of storing digital data in a DNA living cell is truly revolutionary and mind blowing. It is the most optimum option since conventional storage is asymptotically going to hit the wall soon. We can encode images and movies into synthetic DNA, which acts as a molecular recorder and collects data over time. Imagine riding on a time machine for 400 years into the future and looking back at movies and images from 2018. It would be refreshingly electrifying. If DNA storage had existed some 3000 years ago, we would have been able to witness how Moses crossed the Red Sea and the horrific crucifixion of Jesus. Both events would create an unforgettable posttraumatic mental disorder. Converting binary information to DNA code is a great historical Rosetta Stone that will carry our digital universe into the future. This is an amazing mind-stretching dream. Even DNA has a special day, called DNA day! So, what is so hypnotic about DNA? It is not an exaggeration to say that DNA control has discrete information about our birth and death. Scientists are working every single day on knowing more about “mutations” and mechanisms that could possibly confirm or refute DNA mysteries and provide access to important DNA data. Modern science and medicine have taught us that genetics dictates every portion of our lives in the sense of health and what we can expect in wellness or disease. It seems reasonable to say that if medicine cannot identify the cause and cure of an illness, then genetics will have an answer for it. Genetics’ original theory was reasonable because people age on a daily basis, and no method for stopping the aging clock had been found. It must be that each cell was permanently stamped by time and could never be reverted or made younger. Aging, and every other factor related to the body, was a predestined program unfolding into its unknown final stop as time passed. Dr. Ian Wilmut was the geneticist who changed that line of thinking. Dr. Wilmut’s research finally proved that the human cell, the basis of all life, was not predestined to live or die by the secret code of the genes. Cells were not permanently stamped in time. In fact, this new research proved that all the genetic information encoded as your inheritance is reprogrammable and enhanced to extend longevity and eliminate the human misery from disease. Fig. 1.1 illustrates the magic of our genetic code. Deoxyribonucleic acid, known as DNA, is a bewildering engineering marvel. It is a mechanism that has been at work on the earth for billions of years: storing information in the form of DNA strands. This fundamental process is what has allowed all living species, plants and animals alike, to live on from generation to generation. We would not be here on the earth without this magic organism. The information in your genes is presently programmable; this is no longer a fear. Research has now demonstrated that your destiny is not predefined: It is set by the choices you make every day. Even more real than the fact that your genes are programmable in the present is the fact that the way the information in your genes is interpreted is changeable on a daily basis. Discovery is an exquisite drive to explore and learn about the unknown. It is true that Archimedes’ “Eureka” was also a very emotional shout. Scientific discoveries may seem like sudden breakthroughs, but new findings do not come out of nowhere. Each breakthrough is made possible by the work that came before it. Some scientific discoveries are a bit like putting together the pieces of a puzzle. Many different researchers discover important bits of evidencedpieces of the puzzledand the sudden breakthrough arises when one person sees how the puzzle pieces logically fit together. In the case of
DNA, the Columbus discovery
3
FIGURE 1.1 The normal text code is translated into ASCII binary code, and then each combination of 01, 00, 10, 11 is converted into DNA code.
DNA, new findings and technological advances have made so many new puzzle pieces available that the odds of someone putting them together seem quite high. Making this final leap often involves a brilliant insightdbut it is important to recognize all the clues, which made that insight possible. If we compare the human body to a building, the body’s complete plan and project down to its minute technical detail is present in DNA, which is in the nucleus of each cell. All the developmental phases of a human being in the mother’s womb and after birth take place within the outlines of a predetermined program. We’ are made up of many, many cells that we cannot see, and each cell has a job. Some clusters of cells make up our muscles, some make up our bonesdand all together, they make our bodies! But how does each cell know what to do? That is where DNA comes in. It tells the cells what to do. We are all made of trillions of cells. There are around 2.5 billion cells in one of your hands, but they are tinydso tiny that we cannot see them. If every cell in your hand was the size of a grain of sand, your hand would be the size of a school bus! Each cell has its own job, just like humans do. Some cells help us detect light and see, other cells help us touch, some help us hear, others carry oxygen around, and others help us digest food by secreting enzymes. There are over 200 cell types in the bodydthat is 200 different jobs!
DNA, the Columbus discovery The discovery of DNA took a lot of patience and sweat. It happened in increments stretching from the 19th century through the 1940s and 1950s, until today. We cannot give credit to one scientist, but the work continued from different angles and stages. The voyage to unpack DNA’s fundamental secrets
4
Chapter 1 Discovery of the book of lifedDNA
began in 1865 in a monastery where a humble monk named Gregor Mendel had a holy moment when he witnessed pea plants inherited parents’ traits. Mendel did not know that what was being transferred was deoxyribonucleic aciddDNA. In 1998, after a series of experiments, scientists revealed that gene expression is controlled by a phenomenon called RNA interference. This process defends against viruses that try to insert themselves into DNA and control gene expression. This discovery was awarded a Nobel Prize in 2006 and has directly led to research on “silence genes” that cause problems for the body, such as the gene causing high blood pressure. Before cells divide, they must double their DNA so that each cell gets identical copies of the DNA strands. DNA replication helps assure that the bases are copied correctly. Enzymes carry out the process. So DNA is actually a biological USB drive, a carrier of genetic instructions for the development and life of an organism to another organism. So one exciting discovery leads to another astonishing revelation until the human genome was decoded. The discovery of the human genome is the entirety of our DNA as described by the former president Bill Clinton “without a doubt . the most important, most wonderous map ever produced by humankind.” Scientists dutifully continue to follow the map and step into uncharted territory. Scientists are working devotedly to connect the dots. Not only new connections between genes and diseasesd yesterday Alzheimer’s, today cancer, tomorrow perhaps depressiondare brought to light. And the search for knowledge keeps marching on. It is no theoretical endeavor.
The DNA pioneers Let us, for a moment, appreciate the excruciating “sweat work” of these dedicated DNA explorers crumbling the mystery of DNA, the blueprint of life (see Fig. 1.2). 1865dthe Czechoslovakian monk Gregor Mendel, called the father of genetics, discovered how pea plants inherited parents’ traits. The law of inheritance was named after him. 1869dFriedrich Miescher, the daddy of DNA, discovered that every cell carries the same DNA goo, as hereditary information. 1910dThomas Hunt Morgan, with his fruit fly experiment, proved that chromosomes are sex specific (XX for females, XY for males). 1928dFrederick Griffith discovered genetic transformation, which underpins much of the genetic engineering being done today. And then the big revelation came up when the “digital universe” merged with the “biomedicine universe”: DNA emerged as the prospective medium for data storage with its striking features. 1931dBarbara McClintock, Nobel Prize Laureate of 1983, known as the foremost cytogeneticist in the world, is best known for her research and resulting theories on “jumping genes.” 1944dOswald Avery discovered that DNA carries a cell’s genetic material and can be altered through transformation, which consists of chemical transformation, electroporation, or particle bombardment. 1953dRosalind Franklin was a British chemist best known for her role in the discovery of the structure of DNA. 1953dJames Watson and Francis Crick contributed to the discovery of DNA structure and the spiral ladder, also known as the double helix.
The chronology of the luminaries and their great vision to help humanity. The adventure is just beginning!.
The DNA pioneers
FIGURE 1.2
5
6
Chapter 1 Discovery of the book of lifedDNA
1966dMarshall Nirenberg, a Physiology Nobel Laureate, was able to break the “genetic code” and describe how it operates in protein synthesis. 1973dStanley Cohen pioneered in cut-and-paste DNA, the first step in reengineering an organism’s DNA. 1975dExperiments have shown that every cell contains an organism’s entire genetic manual, its genome. 1977dFrederick Sanger helped geneticists read the DNA manual. DNA’s four nucleotides repeat millions, even billions, of times within a genome. 1987dFrancis Collins, director of the National Institute of Health, is a physician-geneticist who discovered the genes associated with a number of diseases and led the Human Genome Project. 1988dDNA was introduced into crime forensics. All states in the United States have DNA banks of criminals. 1996dDNA cloning of mammals opened up new intriguing possibilities. 2003dJ. Craig Venter and the Institute of Health are competing for the Human Genome Project. 2010dA new science emerged with the name “epigenetics,” which is the study of how genes are influenced by outside forces, like the environment of lifestyle.
DNA as an organic data castle While geneticists are still on a voyage to unpack the fundamental secrets of DNA, new opportunities have emerged and ushered in a bright new future in data informatics. A systematic process is being formulated to store digital data into DNA. This is a great discovery of biblical proportion. We are fast approaching a new era of the Data Age. We can assign the term “data universe” to the total data created, captured, processed, stored, transferred, and replicated on our planet for the past 60 years. The science behind storing data in DNA has been proven. Researchers have demonstrated that DNA is a scalable, random access, and error-free data storage system. DNA also remains stable for thousands of years and offers utility in long-term data storage. Advancements in next-generation sequencing have enabled rapid and error-free readout of data stored in DNA. As the data storage crisis worsens in the coming years, as shown in Fig. 1.3, DNA will be used to store vast amounts of data in a highly dense medium. Data created by the Internet alone amount to 90% of the total sum. One of the primary reasons is that we have a gargantuan affinity for knowledge gene rated from data. We are going to have a serious problem of austeredI call it “data stampede”dwhere good data will devour good data! IDC forecasts that by 2025 the data universe will have its big bang when data grow to 163 zettabytes (ZB, that is, 163 1021 [bytes]). That is 10 times the 16.1 ZB of data generated in 2016! Magnetic medium will reach its asymptotic scalability soon; that is the time we praise our Holy Grail, DNA! One of the reasons why DNA is considered a better storage system is that 215 petabytes (215 million gigabytes) can be stored in just 1 g of DNA. DNA is “apocalypse-proof” because, even after global disasters, one thing that we can always preserve and store is DNA. In 1965, Mikhail Samoilovich Neiman, a Russian physicist, was the first pioneer who proposed the idea of the possibility of storing and retrieving information from DNA molecules. This technology was known as MNeimON (Mikhail Neiman Oligonucleotides). Fig. 1.4 is an example how a flash USB stick can be used to store 44 1021 bytes of data in 1 g of DNA.
DNA as an organic data castle
FIGURE 1.3 A picture is worth 1000 words. The solid black line shows what we have today as our daily consumption of data from workplace and home. The thinner black lines represent the information that we store, and we can see that we cannot store everything we want, because we are going to run out of magnetic storage. The dashed line represents the DNA-coded storage, which shows that we have plenty of room to store all our information for the next 1000 years, without running out of room.
FIGURE 1.4 The magic DNA flash USB, which is capable of storing 44 1021 bytes of data in 1 g of DNA. It would take 2.6 109 hard drive devices, 227 106 magnetic tapes, or 3 106 Blu-ray CDs to store the same amount of data.
7
8
Chapter 1 Discovery of the book of lifedDNA
Music is also one of the DNA’s talents Here is a scenario that will surprise you: Assume that you could store the whole library of the famous Arabic singer Feirouz, or Kathem Al-Saher, in 30 g of live DNA. Then, 500 years from now (even 3000 years from now), Arabic music fans would be overwhelmed with the beauty of these old songs. One of the most intriguing methods of storing music in DNA is called “music of the spheres.” The method uses bioinformatics technology, developed by Dr. Nick Goldman and Charlotte Jarvis. They took music from the Kreutzer Quartet, the recording of which has been encoded into DNA, and stored it as digital information in synthetic DNA molecules. Goldman and Jarvis suspended the DNA in a soap solution. The soap bubbles would fill the air, pop on visitors’ skin, and “bathe” the room in music. It might sound far-fetched, but the technology to encode music in synthetic DNA was developed by Goldman and his team a few years ago, imitating the binary method computers used to store information digitally and swapping the 0s and 1s for the base chemicals that form DNA sequences. Music of the spheres follows on from a similar project undertaken by Jarvis a few years ago when she encoded simple sentences in the DNA of bacteria.
Appendix Dr. Rosalind Franklin (history was unfair for her). Rosalind Franklin was a British chemist best known for her role in the discovery of the structure of DNA. This amazing woman also pioneered the use of X-ray diffraction. She overcame personal and societal strife to make one of the greatest discoveries in science. Today, July 25th would have marked her 97th birthday. And so, it seems only fitting to honor her life and contribution to science. Rosalind Franklin made a crucial contribution to the discovery of the double helix structure of DNA, but some would say she got a raw deal. Biographer Brenda Maddox called her the “Dark Lady of DNA,” based on a once-disparaging reference to Franklin by one of her coworkers. Unfortunately, this negative appellation undermined the positive impact of her discovery. Indeed, Franklin is in the shadows of science history, for while her work on DNA was crucial to the discovery of its structure, her contribution to that landmark discovery is little known. Franklin was born on July 25, 1920, in London, to a wealthy Jewish family who valued education and public service. At age 18 years, she enrolled in Newnham Women’s College at Cambridge University, where she studied physics and chemistry. After Cambridge, she went to work for the British Coal Utilization Research Association, where her work on the porosity of coal became her Ph.D. thesis, and later it would allow her to travel the world as a guest speaker. In 1946, Franklin moved to Paris, where she perfected her skills in X-ray crystallography, which would become her life’s work. Although she loved the freedom and lifestyle of Paris, she returned after 4 years to London to accept a job at King’s College.
How was DNA first discovered and who discovered it? Read on to find out .
9
Franklin worked hard and played hard. She was an intrepid traveler and avid hiker with a great love of the outdoors. She enjoyed spirited discussions of science and politics. Friends and close colleagues considered Franklin a brilliant scientist and a kindhearted woman. However, she could also be shorttempered and stubborn, and some fellow scientists found working with her to be a challenge. Among them was Maurice Wilkins, the man she was to work alongside at King’s College. A misunderstanding resulted in immediate friction between Wilkins and Franklin, and their clashing personalities served to deepen the divide. The two were to work together on finding the structure of DNA, but their conflicts led to them working in relative isolation. While this suited Franklin, Wilkins went looking for company at “the Cavendish” laboratory in Cambridge, where his friend Francis Crick was working with James Watson on building a model of the DNA molecule. Unknown to Franklin, Watson and Crick saw some of her unpublished data, including the beautiful “photo 51,” shown to Watson by Wilkins. This X-ray diffraction picture of a DNA molecule was Watson’s inspiration (the pattern was clearly a helix). Using Franklin’s photograph and their own data, Watson and Crick created their famous DNA model. Franklin’s contribution was not acknowledged, but after her death, Crick said that her contribution had been critical. Franklin moved to Birkbeck College where, ironically, she began working on the structure of the tobacco mosaic virus, building on research that Watson had done before his work on DNA. During the next few years, she did some of the best and most important work of her life, and she traveled the world talking about coal and virus structure. However, just as her career was peaking, it was cut tragically short when she died of ovarian cancer at age 37 years.
How was DNA first discovered and who discovered it? Read on to find out . The story of the discovery of DNA begins in the 1800s. The molecule now known as DNA was first identified in the 1860s by a Swiss chemist named Johann Friedrich Miescher. Miescher set out to research the key components of white blood cells, part of our body’s immune system. The main source of these cells was pus-coated bandages collected from a nearby medical clinic. Miescher carried out experiments using salt solutions to understand more about what makes up white blood cells. He noticed that, when he added acid to a solution of the cells, a substance separated from the solution. This substance then dissolved again when an alkali was added. When investigating this substance, he realized that it had unexpected properties different to those of the other proteins he was familiar with. Miescher called this mysterious substance “nuclei,” because he believed it had come from the cell nucleus. Unbeknownst to him, Miescher had discovered the molecular basis of all lifedDNA. He then set about finding ways to extract it in its pure form. Miescher was convinced of the importance of nuclei and came very close to uncovering its elusive role, despite the simple tools and methods available to him. However, he lacked the skills to communicate and promote what he had found to the wider scientific community. Ever the perfectionist, he hesitated for long periods of time between experiments before he published his results in 1874. Before then, he primarily discussed his findings in private letters to friends. As a result, it was many decades before Miescher’s discovery was fully appreciated by the scientific community.
10
Chapter 1 Discovery of the book of lifedDNA
Glossary (courtesy of MERIT CyberSecuritydarchive) Allele: One of several possible versions of a gene. Each one contains a distinct variation in its DNA sequence. For example, a “deleterious allele” is a form of a gene that leads to disease. Amino acid: The chemical building block of proteins. During translation, different amino acids are strung together to form a chain that folds into a protein. Archaea: Microbes that look similar to bacteria but are actually more closely related to eukaryotes, such as humans. Archaea are single-celled organisms that do not have a nucleus and can only be seen with a microscope. They are found in many different habitats, and many of the first known examples were found in extreme environments. Bacteria: An abundant type of microbe. These single-celled organisms are invisible to the naked eye, do not have a nucleus, and can have many shapes. They’ are found in all types of environments, from Arctic soil to inside the human body. Most bacteria are not harmful to human health, but certain pathogenic bacteria can cause illness. Base: The four “letters” of the genetic code (A, C, T, and G) are chemical groups called bases or nucleobases. A ¼ adenine, C ¼ cytosine, T ¼ thymine, and G ¼ guanine. Instead of thymine, RNA contains a base called uracil (U). Base pair: Different chemicals known as bases or nucleobases are found on each strand of DNA. Each base has a chemical attraction for a particular partner base, known as its complement. C matches up with G, whereas A pairs with T or U. These bonded genetic letters are called base pairs. Two strands of DNA can zip together to form a double helix shape when complementary bases match up to form base pairs. Cancer: A type of disease caused by uncontrolled growth of cells. Cancerous cells may form clumps or masses known as tumors and can spread to other parts of the body through a process known as metastasis. Cas: Abbreviation of CRISPR-associated; may refer to genes (Cas) or proteins (Cas) that protect bacteria and archaea from viral infection. Cas9: A protein derived from the CRISPR-Cas bacterial immune system that has been coopted for genome engineering. Uses an RNA molecule as a guide to find a complementary DNA sequence. Once the target DNA is identified, Cas9 cuts both strands. It has been compared with “molecular scissors” or a “genetic scalpel.” In CRISPR immunity, cutting viral DNA prevents it from destroying the host cell. In genome engineering, cutting genomic DNA initiates a repair process that ends up making a change or “edit” to its sequence. Cell: The basic unit of life. The number of cells in a living organism ranges from one (e.g., yeast) to quadrillions (e.g., blue whale). A cell is composed of four key macromolecules that allow it to function (protein, lipids, carbohydrates, and nucleic acids). Among other things, cells can build and break down molecules, move, grow, divide, and die. Chromosome: The compact structure into which a cell’s DNA is organized and held together by proteins. The genomes of different organisms are arranged into varying numbers of chromosomes, and human cells have 23 pairs. Cleave: The scientific term for cut or break apart. Typically refers to splitting apart a long polymeric molecule such as DNA, RNA, or protein. For example, a nuclease like Cas9 can be directed to cleave DNA at a specific location.
Glossary (courtesy of MERIT CyberSecuritydarchive)
11
Complementary: Describes any two DNA or RNA sequences that can form a series of base pairs with each other. Each base forms a bond with a complementary partner. T (DNA) and U (RNA) bond with A, and C complements G. For example, in CRISPR immunity, the spacer sequence in a guide RNA is complementary to a sequence found in a viral genome. When the RNA bases pair with complementary DNA bases from an invading virus, the Cas9 protein will cut the target to stop the viral infection. Cpf1 (Cas12): A protein derived from the CRISPR-Cas bacterial immune system that has been coopted for genome engineering. Uses an RNA molecule as a guide to find a complementary DNA sequence. Once the target DNA is identified, Cpf1 cuts both strands. It has been compared with “molecular scissors” or a “genetic scalpel.” In CRISPR immunity, cutting viral DNA prevents it from destroying the host cell. In genome engineering, cutting genomic DNA initiates a repair process that ends up making a change, or “edit,” to its sequence. CRISPR: Pronounced “crisper.” An adaptive immune system found in bacteria and archaea, coopted as a genome engineering tool. Acronym of “clustered regularly interspaced short palindromic repeats,” which refers to a section of the host genome containing alternating repetitive sequences and unique snippets of foreign DNA. CRISPR-associated surveillance proteins use these unique sequences as molecular mugshots, as they seek out and destroy viral DNA to protect the cell. CRISPR RNA (crRNA): During CRISPR immunity, the host cell generates crRNA molecules, each containing one spacer that is complementary to a portion of a viral genome. crRNAs guide CRISPR immune proteins to find and destroy matching invader sequences. CRISPR screening: A technique that lets scientists see the effects of turning gene expression up or down with CRISPRa and CRISPRi. Instead of checking one gene at a time, a single CRISPR screen can provide information about thousands of different genes at a time. CRISPRa and CRISPRi: CRISPR stands for CRISPR activation, and CRISPRi stands for CRISPR interference or inhibition. Both are methods for fine-tuning gene expression. If a gene were a car, CRISPRa is the gas pedal, and CRISPRi is the brake. Using CRISPRa to activate a gene increases protein production. Using CRISPRi to “turn down” a gene reduces the number of protein products made from that gene. dCas9: Catalytically inactive, or “dead,” Cas9. This mutated version of the Cas9 protein cannot cut but still binds tightly to a particular DNA sequence specified by the guide RNA. Can be used to physically block the process of transcription, turning off a specific gene, or to shuttle other proteins to a particular site in the genome. DNA: Abbreviation of deoxyribonucleic acid, a long molecule that encodes the information needed for a cell to function or a virus to replicate. Forms a double helix shape that resembles a twisted ladder. Different chemicals called bases, abbreviated as A, C, T, and G, are found on each side of the ladder or strand. The bases have an attraction for each other, making A stick to T and C stick to G. These rungs of the ladder are called base pairs. The sequence of these letters is called the genetic code. Double-strand break (DSB): When both strands of DNA are broken, two free ends are created. May be made intentionally by a tool such as Cas9. Cells repair their DNA to prevent cell death, sometimes changing the DNA sequence at the site of the break. Initiating or controlling this process with the intent to alter a DNA sequence is known as genome engineering. Enzyme: A molecule, typically a protein, that causes or catalyzes a chemical change. Usually an enzyme’s name describes a molecule involved in the activity it performs and ends with the suffixase.
12
Chapter 1 Discovery of the book of lifedDNA
For example, lactase is a well-known enzyme that breaks down lactose, a sugar found in milk. Cas9 is a nuclease, an enzyme that breaks apart the backbone of nucleic acids (RNA or DNA). Epigenetic: Refers to changes to a cell’s gene expression that do not involve altering its DNA code. Instead, the DNA and proteins that hold onto DNA are “tagged” with removable chemical signals. Epigenetic marks tell other proteins how to read the DNA, which parts to ignore, and which parts to transcribe into RNA. Eukaryote: A domain of organisms whose cells contain a nucleus and other organelles. Eukaryotes are often large and multicellular (e.g., elephants) but can also exist as microscopic, single cells (e.g., yeast). This category of life includes humans. Compare with prokaryotes (bacteria and archaea). Expression: A product being made from a gene; can refer to either RNA or protein. When a gene is turned on, cellular machines “express” this by transcribing the DNA into RNA and/or translating the RNA into a chain of amino acids. For example, a “highly expressed” gene will have many RNA copies produced, and its protein product is likely to be abundant in the cell. CRISPRi and CRISPRa are methods for turning gene expression down or up, respectively. Gene: A segment of DNA that encodes the information used to make a protein. Each gene is a set of instructions for making a particular molecular machine that helps a cell, organism, or virus function. Gene drive: A mechanism for preferential inheritance of a particular DNA sequence. Usually, offspring have a semirandom chance of inheriting a given stretch of DNA from either parent. In a scientist-designed gene drive, a gene is engineered to have a 100% chance of being passed on. Gene drives can force the inheritance of a desirable trait through a population of organisms. For example, this approach could potentially make all mosquitoes incapable of transmitting the malaria parasite. Gene therapy: Delivering corrective DNA to human cells as a medical treatment. Certain diseases can be treated or even cured by adding a healthy DNA sequence into the genomes of particular cells. Scientists and doctors typically use a harmless virus to shuttle genes into targeted cells or tissues, where the DNA is incorporated somewhere within the cells’ existing DNA. CRISPR genome editing is sometimes referred to as a gene therapy technique. Genetically modified organism (GMO): A GMO has had its DNA intentionally altered using scientific tools. Any organism can be engineered in this manner, including microbes, plants, and animals. Genome: The entire DNA sequence of an organism or virus. The genome is essentially a huge set of instructions for making individual parts of a cell and directing how everything should run. Genome editing: Intentionally altering the genetic code of a living organism. Can be done with ZFNs, TALENs, or CRISPR. These systems are used to create a double-strand break at a specific DNA site. When the cell repairs the break, the sequence is changed. It can be used to remove, change, or add DNA. Genome surgery: Repairing harmful DNA through a one-time genome editing procedure. Unlike taking a drug that will temporarily reduce long-term symptoms, altering a patient’s genetic code with the CRISPR-Cas9 “molecular scalpel” would permanently and directly reverse the cause of a genetic disease. Genomics: The study of the genome, all the DNA from a given organism. It involves a genome’s DNA sequence, organization, and control of genes, molecules that interact with DNA, and how these different components affect the growth and function of cells.
Glossary (courtesy of MERIT CyberSecuritydarchive)
13
Germ cells: The cells involved in sexual reproduction: eggs, sperm, and precursor cells that develop into eggs or sperm. The DNA in germ cells, including any mutations or intentional genetic edits, may be passed down to the next generation. In contrast, the genetic material in somatic cells (all the cells in the body except for germ cells) cannot be inherited by offspring. Note that genome editing in an early embryo is considered to be germline editing since any DNA changes will likely end up in all cells of the organism that is eventually born. Guide RNA (gRNA): A two-piece molecule that Cas9 binds and uses to identify a complementary DNA sequence. Composed of the CRISPR RNA (crRNA) and trans-activating CRISPR RNA (tracrRNA). Cas9 uses the tracrRNA portion of the guide as a handle, whereas the crRNA spacer sequence directs the complex to a matching DNA sequence. Scientists have also formed a version of the guide RNA that consists of a single molecule, the single-guide RNA (sgRNA). Homology-directed repair (HDR): A way for a cell to repair a break in its DNA by “patching” it with a piece of donor DNA. The donor DNA must contain similar sequences, or homology, to the broken DNA ends for it to be incorporated. HDR is a more precise repair pathway than nonhomologous end joining. In genome engineering, a researcher designs and adds in the donor DNA, potentially allowing scientists to replace a disease-causing gene with a healthy copy. Indel: Abbreviation for insertion or deletion. Refers to the random removal or addition of nucleotides from a DNA sequence. This can be enough to stop a gene from functioning (imagine removing a page from the middle of an instruction manual). Indels occur when DNA is broken and “sloppily” repaired by the cell in a process called nonhomologous end joining (NHEJ). Microbe: A microscopic organism. Can be single-celled or multicellular, and is sometimes used to refer to viruses, although they are not considered to be alive. Examples include bacteria, yeast, and algae. Mutation: A change from one genetic letter (nucleotide) to another. Variation in DNA sequence gives rise to the incredible diversity of species in the world and even occurs between different organisms of the same species. While some mutations have no consequence at all, certain mutations can directly cause disease. Mutations may be caused by DNA-damaging agents such as UV light or may arise from errors that occur when DNA is copied by cellular enzymes. They can also be made deliberately via genome engineering methods. Nonhomologous end joining (NHEJ): A way for a cell to repair a break in its DNA by attaching the free DNA ends. This pathway is “sloppier” than homology-directed repair and often results in the random addition or removal of nucleotides around the site of the DNA break, causing insertions or deletions in the genetic code. In genome engineering, this allows scientists to stop a gene from working (similar to removing a page from the middle of an instruction manual). Nick: When only one strand of DNA is broken, there is a gap, called a nick, in the backbone, but the DNA does not separate. A tool like CRISPR-Cas9 may be used to generate a nick. Nuclease: An enzyme that breaks apart the backbone of RNA or DNA. Breaking one strand generates a nick, and breaking both strands generates a double-strand break. An endonuclease cuts in the middle of RNA or DNA, whereas an exonuclease cuts from the end of the strand. Genome engineering tools such as Cas9 are endonucleases. Nucleic acid: A term for DNA and RNA. Refers to nucleotides, the basic chemical units that are strung together to make DNA or RNA. One of the four macromolecules that make up all living things (protein, lipids, carbohydrates, and nucleic acids).
14
Chapter 1 Discovery of the book of lifedDNA
Nucleotide: One of the basic chemical units strung together to make DNA or RNA. Consists of a base, a sugar, and a phosphate group. The phosphates can link with sugars to form a string called the DNA/RNA backbone, whereas the bases can bind to their complementary partners to form base pairs. Off-target effect: When a genome engineering enzyme cuts DNA at an unintended, “off-target,” site that is similar to the intended target. Protospacer adjacent motif (PAM): A short sequence that must be present next to a DNA target sequence for Cas9 to bind and cut. Prevents cleavage of host CRISPR array, where PAM is not present. Pathogen: A microbe that causes illness. Most microorganisms are not pathogenic to humans, but some strains or species are harmful. Phage: A type of virus that infects bacteria or archaea, formally called bacteriophage. Prokaryote: A category of living organisms that encompasses all bacteria and archaea. Prokaryotes are microscopic, single-celled organisms that do not have a nucleus or other membranebound organelles. Compare with eukaryotes. Protein: A string of amino acids folded into a three-dimensional structure. Proteins are each specialized to perform a specific role to help cells grow, divide, and function. One of the four macromolecules that make up all living things (protein, lipids, carbohydrates, and nucleic acids). Ribonucleoprotein complex (RNP): An assembly of molecules containing both protein and RNA. Often used to describe Cas9 protein bound to guide RNA (gRNA), which together form an active enzyme. For genome editing in cells, Cas9 can be delivered as a preassembled RNP, or as DNA or RNA encoding the genetic instructions for the protein and RNA components. RNA: Abbreviation of ribonucleic acid. Transcribed from a DNA template and typically used to direct the synthesis of proteins. CRISPR-associated proteins use RNAs as guides to find matching target sequences in DNA. Single-guide RNA (sgRNA): A version of the naturally occurring two-piece guide RNA complex engineered into a single, continuous sequence. The simplified single-guide RNA is used to direct the Cas9 protein to bind and cleave a particular DNA sequence for genome editing. Somatic cells: All the cells in a multicellular organism except for germ cells (eggs or sperm). Mutations or changes to the DNA in the soma will not be inherited by subsequent generations. Stem cells: Cells with the potential to turn into a specialized type of cell or to divide to make more stem cells. Most cells in your body are differentiateddthat is, their fate has already been decided and they cannot morph into a different kind of cell. For example, a cell in your brain cannot transform into a skin cell. Embryonic stem cells are found in developing embryos, whereas adult stem cells are found in tissues including bone marrow, blood, and fat. Adult stem cells replenish the body as it becomes damaged over time. Strand: A string of connected nucleotides; can be DNA or RNA. Two strands of DNA can zip together when complementary, and bases match up to form base pairs. DNA typically exists in this double-stranded form, which takes the shape of a twisted ladder or double helix. RNA is typically composed of just a single strand, although it can fold up into complex shapes. Transcription activator-like effector nuclease (TALEN): A genetic engineering tool wherein one portion of the protein recognizes a specific DNA sequence and another part cuts DNA. Made by attaching a series of smaller DNA-binding domains together to recognize a longer DNA sequence. This DNA-binding domain is fused to a nuclease that will cut nearby DNA. Like CRISPR-Cas9 and ZFNs, it can be used to alter DNA sequences.
Suggested readings
15
Trans-activating CRISPR RNA (tracrRNA): Pronounced “tracer RNA.” In the CRISPR- Cas9 system, the tracrRNA base pairs with the crRNA to form a functional guide RNA (gRNA). Cas9 uses the tracrRNA portion of the guide as a handle, whereas the crRNA spacer sequence directs the complex to a matching viral sequence. Transcription: The process by which DNA information is copied into a strand of RNA; performed by an enzyme called RNA polymerase. Translation: The process by which proteins are made based on instructions encoded in an RNA molecule. Performed by a molecular machine called the ribosome, which links together a series of amino acid building blocks. The resulting polypeptide chain folds up into a particular 3D shape, known as a protein. Virus: An infectious entity that can only persist by hijacking a host organism to replicate itself. It has its own genome but is technically not considered a living organism. Viruses infect all organisms, from humans to plants to microbes. Multicellular organisms have sophisticated immune systems that combat viruses, whereas CRISPR systems evolved to stop viral infection in bacteria and archaea. Zinc-finger nuclease (ZFN): A genetic engineering tool wherein one portion of the protein recognizes a specific DNA sequence and another part cuts DNA. Made by attaching a series of smaller DNA-binding domains together to recognize a longer DNA sequence. This DNA-binding domain is fused to a nuclease that will cut nearby DNA. Like CRISPR-Cas9 and TALENs, it can be used to alter DNA sequences.
Suggested readings Articles on DNA. https://theconversation.com/us/topics/dna-251. Case study DNA. http://www.ks.uiuc.edu/Training/CaseStudies/pdfs/dna.pdf. http://theconversation.com/what-does-dna-sound-like-using-music-to-unlock-the-secrets-of-genetic-code-78767. DNA for kids. https://owlcation.com/academia/explaining-dna-to-a-six-year-old. Eldering, C., Sylla, M.L., Eisenach, J.A., October 1999. Is There a Moore’s Law for Bandwidth. IEEE C Communications 117e121. https://futurism.com/is-music-in-our-dna/. An example is Kurzweil Voice, developed originally by Kurzweil Applied Intelligence Ganek, Alan G., March 2003. The dawning of the autonomic computing era. IBM Systems Journal. http://www.findarticles.com/p/ articles/mi_m0ISJ/is_1_42/ai_98695283/print. Kurzweil, Ray, 2005. “The singularity is near”; Penguin Books. National Geographic King Tut and the Golden Age of the Pharaohs, May 25th, 2018. Pandora Music. https://phys.org/news/2017-09-items-music-anthology-eternity-dna.html Timeline of DNA https://www.dna-worldwide.com/resource/160/history-dna-timeline. Storing Music in DNA. https://phys.org/news/2017-09-items-music-anthology-eternity-dna.html DNA structure file:///C:/DNA-Storage/DNA-structure.pdf. Structure and History of DNA. https://www.mun.ca/biology/scarr/Biol4241_2016_Watson&Crick_DNA.pdf. What Is a Genome.https://www.yourgenome.org/facts/what-is-a-genome.
CHAPTER
The amazing human DNA system explained
2
What an exquisitely talented engineer Mother Nature is, and how man is in comparison. dAlice Park
And now the announcement of Watson and Crick about DNA. This is, for me, the real proof of the existence of God. dSalvador Dali
DNA neither cares nor knows. DNA just is. And we dance to its music. dRichard Dawkins
The capacity to blunder slightly is the real marvel of DNA. Without this special attribute, we would still be anaerobic bacteria and there would be no music. dLewis Thomas
“Magical DNA” found in King Tut study Tutankhamun was a pharaoh during ancient Egypt’s New Kingdom era, about 3300 years ago. He ascended to the throne at the age of 9 years but ruled for only 10 years before dying at 19 around 1324 BC. A DNA study says, King Tut was a frail pharaoh, beset by malaria and a bone disorderdhis health possibly compromised by his newly discovered incestuous origins. The scientists found DNA from the mosquito-borne parasite that causes malaria in the young pharaoh’s bodydthe oldest known genetic proof of the disease. The good condition of the DNA from King Tut’s body (see Fig. 2.1) surprised many members of the study team. The team suspects that the embalming method the ancient Egyptians used to preserve the royal mummy inadvertently protected DNA as well as flesh. Preserving DNA was not the aim of the mummification, of course, but the embalming method used was very helpful to today’s geneticists.
Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00002-1 Copyright © 2020 Elsevier Inc. All rights reserved.
17
18
Chapter 2 The amazing human DNA system explained
FIGURE 2.1 The DNA examination of King Tut’s body revealed previously unknown deformation in the king’s left foot, caused by the necrosis, or death, of bone tissue. The study found more than one strain of the malaria parasite, indicating that King Tut caught multiple malarial infections during his life.
Fighting chaos The human body has five vital organs (in no specific order) that are essential for survival: the brain, heart, kidneys, liver, and the lungs. So, where does DNA fit in this vital organ list? Well, DNA is as critical as the air we breathe. Because DNA controls the structure and functions of all the cells in the body, it carries the instructions that are essential to the production of protein that determine our genetic profile. This is the central dogma of molecular biology. If our proteins do not do their job correctlydas instructeddand instead grow wild, then tumors may develop and even become malignant.
Anatomy of DNA DNA is the blueprint of lifedthe biological fabricdthat carries the genetic instructions in the correct time and sequence (see Fig. 2.2). DNA is used in the growth, development, functioning, and reproduction of all known living organisms, including many viruses. The building blocks that are used to write the genetic instructions are made of A, C, G, and T letters, which provide the precise blueprints for how our bodies are built and how they work. In the digital world, the binary code (0 and 1) is used to build the computer instructions.
The central dogma of genetics
19
FIGURE 2.2 DNA is a big molecule with a spiral shape called a double helix. DNA contains the original codes for making the proteins that living cells need. mRNA (m stands for messenger) which carries the DNA original code and passes it to the Ribosome protein engine, and ribosome will read its coding sequences and put the appropriate amino acids before it gets converted into protein.
Think of DNA as a long string of pearls made with building blocks called nucleotides. Each pair of nucleotides is called a base. DNA is an incredible autonomic, smart, self-regulating, self-repairing mechanism with the most critical responsibility in the human body. It manufactures the human bodydcell by cell, part by partdand extends life from one person to another.
The central dogma of genetics The central dogma is the process of building a functional product protein from DNA instructions. It was first proposed in 1958 by Francis Crick, discoverer of the structure of DNA. The central dogma explains the flow of genetic information from DNA to RNAdan intermediary black boxdto make a protein. The central dogma suggests that DNA contains the information (the genetic code) needed to make all of our proteins and that RNA is the messenger that carries this information to the ribosomes that serve as factories in the cell where the information is “translated” from a code into the functional product. Consider DNA is a long string of pearls, which is made with building blocks called nucleotides as demonstrated in Fig. 2.3. Each pair of nucleotides is called base. DNA is an incredible autonomic, smart, self-regulating, self-repairing mechanism with the most critical responsibility in the human body. It manufactures the human body cell by cell and part by part and extends life from one person to another.
20
Chapter 2 The amazing human DNA system explained
FIGURE 2.3 DNA genetic code comes in pairs. The four types of nitrogen bases are adenine (A), thymine (T), guanine (G), and cytosine (C). The order of these bases is what determines DNA’s instructions or genetic code.
The genetic code DNA is made of “contiguous” chemical building blocks, arranged in pairs, called nucleotides. To form a strand of DNA, nucleotides are linked in pairs and contain the instructions needed to build a protein organism. A gene is the DNA sequence of nucleotides that contain the genetic information required to make a specific protein. Fig. 2.4 shows visually the structure of DNA. The process by which the DNA genetic code is converted into the functional product is called gene expression, which has two key stagesdtranscription and translation. In transcription (first stage), the genetic code in the DNA of every cell is converted into small, portable RNA messages. During translation (second stage), these RNA messages travel to ribosomes (protein factories) where they are “synthesized” to make specific proteins. The central dogma states that such information travels in oneway direction from DNA to protein. The spiral staircaseeshaped double helix has attained global status as the symbols for DNA. But what is so beautiful about the discovery of the twisting ladder structure is not just its good looks. Rather, the structure of DNA taught researchers a fundamental lesson about genetics. It taught them that the two connected strandsdwinding together like parallel handrailsdwere complementary to each other, and this unlocked the secret of how genetic information is stored, transferred, and copied. Long strings of nucleotides (DNA building blocks) form genes, and groups of genes are packaged tightly into structures called chromosomes. Every cell in your body, except for eggs, sperm, and red blood cells, contains a full set of chromosomes in its nucleus. Fig. 2.5 shows how all the pieces are connected together.
The genetic code
21
FIGURE 2.4 DNA genetic code comes in pairs. The four types of nitrogen bases are adenine (A), thymine (T), guanine (G), and cytosine (C). The order of these bases is what determines DNA’s instructions or genetic code.
FIGURE 2.5 The long, stringy DNA that makes up genes is spooled within chromosomes inside the nucleus of a cell. (Note that a gene would be a much longer stretch of DNA than what is shown here.)
22
Chapter 2 The amazing human DNA system explained
If the chromosomes in one of your cells were uncoiled and placed end to end, the DNA would be about six feet long. If all the DNA in your body were connected in this way, it would stretch approximately 67 billion miles! That is nearly 150,000 round trips to the moon. Chromosomes are numbered 1 to 22, according to size, with 1 being the largest chromosome. The 23rd pair, known as the sex chromosomes, are called X and Y. In humans, abnormalities of chromosome number usually occur during meiosis, the time when a cell reduces its chromosomes from diploid to haploid in creating eggs or sperm. During the transcription stagedmoving from DNA to RNAdthe messenger RNA (mRNA) takes the DNA paired sequence and converts it into triplet genetic codedbeginning with a “start” codon and ending with a “stop” codon. Let us not forget how intricate the design of the cell and all its contents are. Each component has its magic and mystery. For example, to give you a sense of just how important DNA packing is, consider that the DNA in a typical human cell would be about 222 m long if it were extended in a straight line. All 222 m of that DNA are squeezed into a tiny nucleus with a diameter of just 0.006 mm. That is a feat geometrically equivalent to packing 404,040 km (242,424 miles) of extremely fine thread into a tennis ball. To make a copy of itself, the twisted, compacted double helix of DNA has to unwind and separate its two strands. Each strand becomes a pattern, or template, for making a new strand, so the two new DNA molecules have one new strand and one old strand. The copy is courtesy of a cellular protein machine called DNA polymerase, which reads the template DNA strand and stitches together the complementary new strand. The process, called replication, is astonishingly fast and accurate, although occasional mistakes, such as deletions or duplications, occur. Fortunately, a cellular spell-checker catches and corrects nearly all these errors.
Properties of the genetic code The creation of protein is a two-step process. Exons are coding sequences in genes; introns are pieces of DNA/RNA that interrupt exons. The first step in creating the genetic code involves RNA synthesis (transcription), where the DNA is transcribed into RNA and exons and introns are still present. Introns get removed, and the resulting exons form a protein. The second step is RNA splicing (cut and paste), where introns are being removed. Once the RNA has been transcribed, it travels from the DNA template to the ribosome, which is an assembly machine where protein synthesis is achieved. Each three bases in the RNA sequence code for one amino acid. Alternative splicing causes different exons to be combined into messenger RNA (mRNA). These mRNA are translated into proteins in a step called protein synthesis. Fig. 2.6 illustrates the two phases to make protein (transcription and translation). Here are some of the strategic terms that should be remembered: RNA: Single-stranded nucleic acid involved in protein synthesis. It is a nucleic acid, like DNA, but differs slightly in its structure. mRNA: Messenger RNA; molecule that carries the instructions from the DNA to the rest of the cell. mRNA carries the instructions from the nucleus to the cytoplasm. rRNA: Ribosomal RNA; one of the molecules that makes up ribosomes. It is involved in the process of ordering the amino acids to make proteins.
Another blessing of DNA, biological profiling
23
FIGURE 2.6 From DNA gene to messenger RNA (mRNA). All the instructions from DNA will be carried by mRNA to the rest of the cell. mRNA carries the instructions from the nucleus to the cytoplasm.
tRNA: Transfer RNA; molecule that carries a specific amino acid to the ribosome to make a protein. DNA copying is not the only time when DNA damage can happen. Prolonged, unprotected sun exposure can cause DNA changes that lead to skin cancer, and toxins in cigarette smoke can cause lung cancer. It may seem ironic, then, that many drugs used to treat cancer work by attacking DNA. That is because these chemotherapy drugs disrupt the DNA copying process, which goes on much faster in rapidly dividing cancer cells than in other cells of the body. The trouble is that most of these drugs do affect normal cells that grow and divide frequently, such as cells of the immune system and hair cells.
Another blessing of DNA, biological profiling DNA fingerprintingdalso known as biological profilingdis a chemical test that shows the genetic makeup of a person or other living thing. It is used as evidence in courts to identify bodies, to track down blood relatives, and to look for cures for disease. Fig. 2.7 shows the translation table where each triplet code (called codon) represents an amino acid leading to making protein. •
TwinsdIdentical twins share the same genetic material, whereas fraternal (nonidentical) twins develop from two eggs fertilized by two sperm and are no more alike than individual siblings born at different times. It can be difficult to tell at birth whether twins are identical or fraternal.
24
• • • •
Chapter 2 The amazing human DNA system explained
PaternitydTo find out if the alleged father is the biological father of the child. SiblingsdFor example, adopted people may want to have DNA tests to make sure that alleged biological siblings are their blood brothers or sisters. ImmigrationdSome visa applications may depend on proof of relatedness. Criminal justicedDNA testing can help solve crimes by comparing the DNA profiles of suspects to offender samples. Victorian law allows the collection of blood and saliva samples from convicted criminals and suspects. DNA profiles are then kept in a database.
FIGURE 2.7 A triplet code could make a genetic code for 64 different combinations (4 4 4) of genetic code and provide plenty of information in the DNA molecule to specify the placement of all 20 amino acids. When experiments were performed to crack the genetic code, it was found to be a code that was triplet.
The holy grail of DNA: gene editing All religions believe in miracles, but they come at random times, and often only after a genuine prayer. Playing God with human genes is the mother of all miracles. Editingdcutting and pastingdDNA is a true “game changer” in the gene revolution. The people who are doing research in genetics today count themselves among the lucky ones. They are celebrating a new technique that is, by all counts, revolutionary. The cause of their uncharacteristic giddiness is a remarkably reliable method of editing the human genome. Just in a few years’ time, medicine will conquer cancer and HIV. Scientists are learning how to edit DNA and eliminate the mutations that cripple the lives of humans.
Anatomy of CRISPR
25
The yogurt story Here is the amazing story of the CRISPR (clustered regularly interspaced short palindromic repeats) discovery: It turns out that the master key for unlocking DNA editing was waiting to be discovered inside a cup of yogurt. Scientists working in the dairy industry noticed something intriguing in a bacterium used to transform milk lactose into lactic acid, which is needed to produce yogurt. The yogurt bacterium was constantly getting infected by viruses that altered the taste of the product. When scientists sequenced the genome of the bacterium, they kept getting “strange” repeated fragments of DNA. They realized that the bacterium keeps a genetic record of the attacking viruses, as a crude but very effective immune system. Wow! In between the repeated sections of DNA were extracts of the virus’ genes; when the same virus attempted to reinject the bacteria, it would be attracted toward its matching section on the bacterial genome and bind to it. That would create a powerful enzyme that effectively cuts the virus out, leaving the bacteria free from infection. Over time, however, the bacteria developed a way to protect themselves from interruption. They invited and incorporated snippets (an acronym for single nucleotide polymorphism) of other benign viruses into their genome, which then caused the bacteria to produce toxins the successfully defeated the more harmful viruses or trespassers. The bacteria built their own immune system.
How gene editing works The discovery of CRISPR is a true seismic shift, as researchers around the globe have embraced a revolutionary technology called gene editing. In one respect, it is like playing God. In the human immune system, the B cells keep part of the antigen (coming from the attacking pathogen) and develop the antibody to remove the attacking agent. Genealogists learned how bacteria collect segments from the attacking virus, reverse engineer it, and develop enzymes to fight future attacks. CRISPR technology is a simple yet powerful tool for editing genomes. It allows genealogists to easily alter DNA sequences and modify gene function. Its many potential applications include correcting genetic defects, treating and preventing the spread of diseases, and improving crops. However, its promise also raises some ethical concerns.
Anatomy of CRISPR CRISPR can simply be described as a sequence-specific adaptive immunity. Adaptive systems have the ability to learn to recognize specific features of pathogens. CRISPR is a cluster of DNA sequences that help protect healthy cells by identifying and neutralizing viral attacks. We can consider CRISPR with several functions. It is, first, an early warning system that alarms healthy cells of an incoming viral attack. CRISPR is an incredible machinery that can slice and capture hostile genes and move them next to healthy genes. We will provide an analogy to help understand the role of CRISPRs in genomes and how they work to protect healthy cells. Let us assume we are sending telegraphs using Morse code. Every sequence is a message about a different attack, and every space is the STOP that ends that message. Fig. 2.8 shows God delegated gene editing to man with a tool called CRISPR.
26
Chapter 2 The amazing human DNA system explained
FIGURE 2.8 CRIPR is the Rosetta Stone, which is designed to cut and repair God’s design in rewarding the patientdthe sequence is shown. Like an electrician, wires are cut and spliced, and new part is welded and teste. CRISPR places an entirely new kind of power into human hands. For the first time, scientists can quickly and precisely alter, delete, and rearrange the DNA of nearly any living organism, including humans. In the past 3 years, the technology has transformed biology. Working with animal models, researchers in laboratories around the world have already used CRISPR to correct major genetic flaws, including the mutations responsible for muscular dystrophy, cystic fibrosis, and one form of hepatitis. Recently, several teams have deployed CRISPR in an attempt to eliminate HIV from the DNA of human cells. The results have been only partially successful, but many scientists remain convinced that the technology may contribute to a cure for AIDS. MERIT CyberSecurity Engineering Archives.
When a healthy cell encounters a new and dangerous virus, it does not know how to protect itself or beat back the virus; it has to learn, just as most immune responses do. The CRISPR sequences steal key strands of DNA from the virus and keep them in those little Morse code messages. When a similar virus attacks again, CRISPR responds and sends its army to the battlefield. It is pretty much like the human immune system. CRISPR soldiers, called CRISPR-associated protein (CAS) enzymesdproduced specifically for this mission and numbered according to their purposed bind to viral DNA and slice it at its weak spot, according to the information encoded in the message. This shuts down the virus and enables the cell to successfully defend itself. The wheelchair is a symbol of a disability. Inga got her first wheelchair almost 2 years ago. But she has been using it only at home and school. During outdoor walks or shopping, she has always used a buggy. We have been a bit ashamed to use the wheelchair, possibly because we still did not want to
Anatomy of CRISPR
27
accept our daughter’s disability. However, 2 months ago, one of Inga’s physiotherapists advised us to take the wheelchair for Inga instead of the buggy when we went out of the house. Her small wheelchair has a little handle in the back, which enables someone else to push Inga when she gets tired. Jesus was preaching the message to them when four men arrived, carrying a . he said to the paralyzed man, “I tell you, get up, pick up your mat, and go home! The man got up, picked up his mat, and walked out in front of them all. John 5:8. John 6:63.
The rest of the time, Inga can maneuver herself wherever she wants. She enjoys that so much! She can help with the shopping by picking the books from the shelves that she wants to buy. She can play easily with other kids and play games on her own. We feel a bit foolish; we had not considered how the buggy restricted her independence. What is most important, however, is that we move forward. Our fight against SMA goes on, and we hope that we are just one step away from the moment when Inga starts treatment. Thank you so much you are with Inga all that time long! . Yes, in few years, she will be walking and getting rid of the wheelchair Thanks to CRISPR. Fig. 2.9 shows Inga’s portrait on the wheelchair. The big surprise came in 2012 when Dr. Jennifer Doudna and Emmanuelle Charpentier discovered that CRISPR can also be used to edit DNA at any location in the healthy cell in humans. The splicing enzyme can cut DNA to do autocorrection. So CRISPR can be used to remove mutations and diseases and insert healthy DNA in the cell. Just imagine, people with muscular dystrophy will be able to walk in less than a decade. CRISPR-Cas9, in particular, can become an excellent tool for slicing, recombining, and generally editing DNA, as long as it receives the right messages. Cancer, Alzheimer’s, and Parkinson’s diseases will be history.
FIGURE 2.9 Inga in her wheelchair. She will teach us something about life.
28
Chapter 2 The amazing human DNA system explained
Social and ethical issues in DNA fingerprinting Research Professor Alec Jeffreys, of the University of Leicester, eloquently stated, “The national DNA database is a very powerful tool in the fight against crime, but recent developments, such as the retention of innocent people’s DNA, raises significant ethical and social issues.” Professor Jeffreys has expressed his concerns over the national DNA database in the United Kingdom. Previously, the database was used only to store DNA information of past criminals to use as evidence if needed for future cases. Now, the database has evolved. As the technology of DNA fingerprinting has advanced, several social and ethical issues have arisen over the rights of possessing a subject’s DNA. Rightfully so, people are concerned over the consequences they will face if their DNA information becomes publicly accessible. At the moment, laws are in place to protect the subject’s rights to their own DNA profile, but these laws do not completely guarantee the protection of information. The storage of medical records, including results of DNA testing, is a viable solution to protect the privacy of patients and the confidentiality of information. We will be discussing this topic in the following chapters. Hundreds of thousands of people who have been linked to crimes but have been released soon after are also part of the database. Their DNA information is not needed on the database, and it is unethical to keep it there without the written permission of the record owner. Unauthorized people could have access to this sensitive information and could use it to compromise the integrity of the individual. Since DNA technology has been getting better over the years, it could easily and accurately identify an individual’s characteristics and risks of certain diseases. It could determine the risk of certain cancers, criminal behavior, and other illnesses. If private characteristic information is obtained by insurance companies or potential employers, it could seriously impact an individual’s insurance rates and job prospects. For example, if someone’s DNA profile displays that he or she is at risk for breast cancer, an insurance company will want to charge that customer much higher rates due to the potential payouts they will have to give if and when that person does develop breast cancer. In another instance, job employers will not want to hire a person who shows potential criminal behavior in his or her DNA profile, because they would not want to risk crime within their company. Overall, if proper methods are not put in place to strictly restrict the access to DNA profiles in proper circumstances, then many human and privacy rights may be put in jeopardy, and the public will face many social issues.
DNA’s other dark side, the hacking nightmare Looking at the bright side, biologists of the near future will figure out how to program viruses and bacteria to deliver custom-made cures that shrink cancerous tumors or reverse the tide of dementia. However, in the superscary scenario, bioterrorists could engineer deadly superbugs that target humans on a genetic level. In a 2012 article in The Atlantic, a technologically plausible scheme was presented in which the president of the United States is assassinated by a highly contagious cold, designed to target a weak link in his specific genetic code. Chapter 4 is dedicated to the topic of DNA hacking. We discuss how technology is used on the Dark Web to eclipse the bright side of genealogy.
DNA can hack computers
29
DNA is a warm place for music To help solve the world’s information storage problem, scientists turned to DNA. They wanted to prove that no other medium could match its capacity or durability. Recently, two musical performancesdDeep Purple’s “Smoke on the Water” and Miles Davis’s “Tutu”dwere chosen to become DNA files. Their binary code, the digital language consisting of 1s and 0s, was converted into the genetic bases (A, C, G, and T). Fig. 2.10 offers interesting visualization on digital language conversion. The universal nature of DNA means that more than just music can be stored in this manner. Other information that researchers turned genetic include a movie, a computer virus, and an entire computer operating system. The density of the system could 1 day hold all the earth’s data in a single room. Under the right conditions, the genetic files could last for millennia.
FIGURE 2.10 Once the digital language was converted, the bases were synthetically created and arranged to match the binary sequences of the music. The songs covered 140 MB on a hard drive. But after becoming DNA, they barely made a speck. The files were retrieved by reversing the process, and no segment was corrupted.
DNA can hack computers What sounds like a far-fetched movie story line has been accomplished in real lifedscientists hacked a computer using DNA. In 2017, the University of Washington took malware and encoded it into synthetic DNA bases. The leap from biological to digital happened when a computer sequenced the strand. When the software changed the A, C, G, and T combinations back into computer code, the virus was released and gave researchers full remote control of the computer. While this brand of hacking is not being used at the moment, it can only be a matter of time. The purpose of the bizarre infection was to highlight a concern that sequencing equipment, especially that uses open-source software, was
30
Chapter 2 The amazing human DNA system explained
vulnerable to such attacks. Since DNA sequencing and genetic databases are highly valuable to many scientific fields, malware delivered in this manner could cause untold damage. The extensive detail of DNA hacking is in Chapter 12.
Appendices Glossary (MERIT CyberSecurity engineering archives) Get down with the lingo Acetyl: A chemical group of one oxygen atom, two carbon atoms, and three hydrogen atoms that are added to another carbon, where the first carbon of the acetyl group has a double bond with oxygen and also binds to a methyl group. The acetyl group is commonly written as eCOCH3. Acetyl groups are added to histones to promote gene expression. Adenine: One of four nucleic acids that compose DNA and RNA. Adenine is a purine derivative, like guanine. The name adenine refers to the fact that it was first isolated from the pancreatic gland (“gland” is adenas in Greek) by Albrecht Kossel. Go Albert! Allele: A gene variant responsible for a specific inheritable trait. (An allele is also the punch line for a bad joke about a guy named Al who owns an eel.) Amino: A chemical group that can be added to carbon composed of one nitrogen atom and two hydrogen atoms. It is commonly written as eNH2. Oh . amino. Anaphase: The stage of mitosis or meiosis, where individual chromosomes are first pulled along microtubules toward either pole of centrosomes. (Anaphase is also the punch line of a joke, where a girl grows up thinking her name is Ana.) Anticodon: The three-base sequence complementary to the codon sequence. Anticodon sequences are found on transcription RNA (tRNA) molecules, which are the links between codon sequences and a given amino acid. Bacterial artificial chromosome (BAC): A fragment of DNA used to clone or transform bacteria with inserted DNA sequences that range in size from 150 kilobases (kb) to more than 700 kb. BACs are commonly used to sequence large genomes of various organisms. One day, BACs hope to become real chromosomes. Bacteriophage: A virus that infects bacteria, often shortened to “phage.” That is all we got! Base: Nucleotide, or base, is the basic unit or building block of DNA and RNA. Nucleotides include cytosine, thymine, and uracil, which are called pyrimidines, and also include guanine and adenine, which are called purines. Adenine (abbreviated A) pairs with thymine (T) or uracil (U), and guanine (G) pairs with cytosine (C). A goes with T, and G goes with C. Also, see Nucleic Acid. These guys are generally called “bases,” so if you hear the phrase “DNA bases,” it is referring to the nucleotides in DNA. Nucleotides strung together form nucleic acids. Base-pairing rule (Chargaff’s rule): Named after Austrian chemist Erwin Chargaff, the rule that, for any amount of double-stranded DNA, there should be equivalent amounts of pyrimidines to purines. That is, for however many cytosines and thymines there are, called pyrimidines, there should be as many guanines and adenines, called purines. This rule was improved by WatsoneCrick base pairing, which identifies the base pairings in DNA. Adenine (abbreviated A) pairs with thymine (T), and guanine (G) pairs with cytosine (C). A goes with T, and G goes with C.
Appendices
31
Centrosome: An organelle responsible for organizing the microtubules. Centrosomes are also called the microtubule-organizing centers, particularly during mitosis. Centrosomes move to either end of the cell and pull chromosomes apart so that cell division can occur. Chromatin fiber: A coil of nucleosomes that is an intermediate level of packaging of DNA and can be condensed into chromosomes. (These fibers are not the ones found in all-bran cereal.) Chromosome: A single piece of DNA that contains many of the genes, regulatory elements, and nucleotide sequences of an organism. Chromosomes can either be circular, as in bacteria and archaea, or linear, as in eukaryotes. Clone: A cell or organism that is genetically identical to the source from which it was derived. To clone is to produce an identical copy of something. (Hello Dolly, the sheep!) Codon: The three-letter sequence that encodes a specific amino acid. A transcription RNA (tRNA) with the amino acid attached binds specifically to this three-letter sequence, and the specificity of the tRNA binding is determined by the codon, or, rather, the anticodon sequence. Complementary base pairs: See the previously defined base pairing rule (Chargaff’s rule). The complementary base pairs in DNA are adenine paired with thymine, and guanine paired with cytosine. In RNA, thymine is replaced with uracil. Cosmid: A type of plasmid that contains cos sequences from lambda, w, which is a type of phage, and can have 37e52 kb of DNA, compared with the 1e20 kb of normal plasmids. While normal plasmids are used to transform cells, cosmids transduce cells by using the phage con sequences to integrate into the bacterial cell chromosome. The advantage of transduction when compared with transformation is that cosmids can hold larger DNA inserts due to decreased recombination compared to plasmids. Cosmids are used as cloning vectors and can help build genomic libraries. Cytokinesis: The process by which the cytoplasm of a cell is partitioned into two daughter cells. Cytokinesis is similar to when they put a divider into your classroom, if your classroom has dividers. Cytosine: One of four nucleic acids that comprise DNA and RNA. Cytosine is a pyrimidine derivative, like thymine and uracil. Mammalian cells can remove an amino group from cytosine to produce uracil, which prevents infection from retroviruses. Cytosine was first isolated from a calf thymus by Albrecht Kossel. Yay, Albert! Demethylation: The removal of a methyl group from a molecule. Deoxyribose: A derivative of the pentose (five-membered ringed) sugar ribose with the formula C5H10O4. Deoxyribose is the sugar that serves as the backbone for DNA and lacks the 20 -hydroxyl group present in ribose. DNA: Deoxyribonucleic acid. DNA is a macromolecule (“macro” ¼ big) also known as a nucleic acid, which is composed of phosphate groups, deoxyribose sugar groups, and the nucleotides adenine, guanine, cytosine, and thymine. DNA contains the genetic code needed by all cells to produce proteins and other molecules necessary to sustain life. DNA seems to make it into every one of Shmoop’s Biology glossaries. DNA polymerase: The enzyme that adds deoxyribonucleotides to a strand of DNA. The newly added deoxyribonucleotides are complementary to a template strand of DNA. Double helix: The common structure of double-stranded molecules of DNA or RNA, although it is mostly referred to in DNA because RNA is rarely found double-stranded. James Watson and Sir Francis Crick first reported the double helix structure of DNA in 1953 in the journal Nature. DsDNA: Double-stranded DNA.
32
Chapter 2 The amazing human DNA system explained
Epigenetics: The study of the changes that affect gene expression. Gene expression is controlled by factors beyond the DNA sequence, such as DNA methylation and histone modifications, like methylation and acetylation. Euchromatin: The part of chromosomes that is actively transcribing DNA to produce messenger RNA (mRNA) and proteins. Euchromatin is typically characterized by demethylated NA and acetylated histones that allow access to more transcription factors and RNA polymerase. Gene: The basic unit of heredity in an organism, associated with the production of one type of RNA or protein that serves some function. In monogenic, or single gene, traits, one gene is responsible for determining a certain phenotype, whereas polygenic, or multiple gene, traits require multiple genes to determine the phenotype. (Gene is also a great name for a boy who wants to be a movie critic.) Gene expression: The way in which genes are used to synthesize a product that has a specific function or purpose. Gene expression often produces proteins. Gene expression is controlled by factors beyond the DNA sequence, such as DNA methylation and histone modifications, like methylation and acetylation. Gene open reading frame: A section of DNA that does not contain a stop codon within the reading frame. Also abbreviated ORF. Genetic code: The information present in a DNA sequence that is translated into a protein. The genetic code is broken into three nucleotide codons, where each codon encodes a single amino acid. As there are only 21 natural amino acids, many codons encode the same amino acid. Three codons encode a “stop codon,” signaling the ribosome to stop synthesizing the protein. Genome: All the genes and gene-related sequences that define an organism. Eukaryotes have much larger genomes than bacteria, archaea, and viruses. Genome size is likely a function of increased cell size, cell division rate, metabolic rate, developmental rate, and complexity of an organism. Guanine: One of four nucleic acids that compose DNA and RNA. Guanine is a purine derivative, like adenine. The name guanine originates from the fact that it was first isolated from the sea bird guano. (It is a little classier than “poopine.”) Helicase: An enzyme (“-ase” gave that one away) that is responsible for unwinding DNA’s double helix so that DNA replication as well as RNA synthesis can occur. Good on ya, Helicase! Helix: The curved shape that many biological molecules adopt to perform their functions. A DNA molecule is formed from two intertwined helices. Heredity: The process by which traits are passed from mother and father to baby. Also, “inheritable” and “heritable” mean exactly the same thing. Heterochromatin: The opposite of euchromatin, in that it is DNA with little gene expression due to either DNA or histone methylation. Heterochromatin regions have packed DNA, which is why little transcription occurs here. Histone: A positively charged protein that interacts with negatively charged DNA to help in the packing of DNA into chromosomes. Histones play an important role in gene expression; modifying histones by adding or removing methyl, acetyl, or other groups affects how tightly histones bind DNA, which controls gene expression. Hydroxyl: A chemical group of oxygen and hydrogen that is added to a carbon. It is commonly written as eOH. Oh, snap! Hydrolysis: Hydrolysis is a reaction that splits a water molecule into its Hþ and OH ions, and in the process, breaks a polymer by adding an ion to each fragment. This chemical reaction effectively makes two polymer fragments, an acid and a base. Hydrolysis can occur in a sequence-specific manner (see Restriction enzyme) or nonspecifically.
Appendices
33
Messenger RNA (mRNA): The molecule of RNA that contains the blueprint of a protein that will be synthesized. Metaphase: The stage of mitosis where all the chromosomes line up in the middle of the cell along microtubules to prepare to be separated during anaphase. Methyl: A chemical group of one carbon and three hydrogen atoms that is added to another carbon. It is commonly written as eCH3. Methyl groups are treated a lot like currency; they are often added or removed according to a molecule’s needs. It is not clear whether everyone is fighting to have them or fighting to get rid of them. Poor little guys. Methylation: The process of adding a methyl group to a carbon atom. Methylation commonly happens with the silencing of gene expression by DNA methylation. Histone methylation also occurs, although it can either activate or silence gene expression, depending on where the histone is methylated. Microtubule: One of the three components of the cytoskeleton, including actin and intermediate filaments. Microtubules are polymers of molecules called tubulin. They are instrumental in organizing the process of mitosis and ensuring that each daughter cell has the right chromosomes. Microtubules primarily function by forming a path for each chromosome to go to the appropriate daughter cell. Mitosis: The process of cell division where a cell copies all of its genetic information and divides into two identical daughter cells. All cellular contents are duplicated and shared among the two daughter cells. Thanks, Pops! Mutation: A change in the genetic sequence, which may or may not be harmful to the organism. Mutations in most cases result in evolution, where mutants that are better suited for survival outcompete those that lack that particular mutation. However, some mutations may be lethal or deleterious, where mutants are weaker than the wild-type organism, aka, the organism without a mutation. Typical mutations include missense or replacing one nucleotide with another, deletions and insertions. Nuclease: An enzyme (there is that “ease” again) that is responsible for breaking down polymers of DNA or RNA. Nucleases break down DNA or RNA polymers by hydrolyzing 50 -30 phosphodiester bonds (the unit described by two covalent ester bonds among a phosphate group and two pentose sugars) between nucleotides. Hydrolysis is a reaction that splits a water molecule into its Hþ and OH ions, and in the process, breaks a polymer by adding an ion to each fragment. This chemical reaction effectively makes two polymer fragments, an acid and a base. Hydrolysis can occur in a sequencespecific manner (see Restriction enzyme) or nonspecifically. Nucleic acid: A macromolecule (read: big) composed of a pentose (five-membered ringed) sugar, like ribose or deoxyribose, a phosphate group, and a base, either adenine, guanine, cytosine, thymine, or uracil. They are called “acids” due to the negatively charged phosphate groups. DNA replication is the process of polymerizing nucleic acids. Nucleotide: A nucleotide is the basic unit or building block of DNA and RNA. Nucleotides include cytosine, thymine, and uracil, which are called pyrimidines, and also include guanine and adenine, which are called purines. Adenine (abbreviated A) pairs with thymine (T) or uracil (U), and guanine (G) pairs with cytosine (C). A goes with T, and G goes with C. Also, see Nucleic acid. These guys are generally called “bases,” so if you hear the phrase “DNA bases,” it is referring to the nucleotides in DNA. Nucleotides strung together form nucleic acids. Nucleus: Called the “brain” of the cell, it stores all DNA necessary to make RNA and protein, with the exception of mitochondrial DNA, which is in the . mitochondria. The nucleus is encased in two lipid bilayers, called the nuclear membrane, that serve to house the DNA in the nucleus and restrict the
34
Chapter 2 The amazing human DNA system explained
flow of cytoplasmic components into the nucleus. Basically, the nuclear membrane keeps everyone in their places. Nuclei also hate being pronounced “nuculei,” so do not do that. At least not to their faces. Nucleolus: A structure within the nucleus that is not membrane-bound but is instead a protein and nucleotide-dense region of the nucleus where transcription of ribosomal RNA (rRNA) occurs. The nucleus can be easily observed using light microscopy (light plus microscopes). Nucleosome: The basic unit of DNA packaging, where DNA loops twice around eight histone proteins, forming a nucleosome core. Like a rollercoaster, except not at all. Phylogenetics or phylogeny: The area of biology that studies the relationships of various organisms to each other, as described by either genetic information or morphological (shape) relationships. Plasmid: A DNA molecule commonly found in bacteria that is independent from the chromosome. A plasmid can replicate independently from chromosomal DNA and often encodes pathogenic factors (that cause disease) as well as antibiotic resistance genes (that resist antibiotics). Unlike chromosomal DNA, plasmid DNA can easily be transferred from one bacterium to another, which makes it superhandy for DNA technology applications. Ploidy: The number n of sets of chromosomes in a given cell. Hello, Algebra; nice to see you. A haploid cell has n chromosomes, and a diploid cell has 2n chromosomes. Ploidy, ploidy, ploidy. Polyadenylation: The step in the process of RNA transcription where messenger RNA (mRNA) transcripts are tagged with a series of adenine nucleotides to aid in the export of the mRNA and maintain its stability during translation. Polymer: A longish molecule made of repeating smaller molecules. Primer: A short oligonucleotide sequence that serves as the starting point for DNA polymerization. For DNA replication, the primer is often an RNA sequence that is later degraded and replaced with a DNA sequence. Promoter: A regulatory region that is often upstream of the start site of a gene open reading frame. Promoters activate gene expression by recruiting transcription factors, which in turn recruit RNA polymerase (an enzyme) to the site to be transcribed. Prophase: The stage in which the chromatin fibers condense into distinct chromosomal bodies that are visible under a microscope. The nucleus also begins to break down in this stage, although most of the nuclear breakdown occurs in late prophase/early metaphase, in a stage that some refer to as prometaphase. Purine: A nitrogenous base that is composed of a pyrimidine and imidazole ring (double ring). Purines such as adenine and guanine serve as bases for DNA or RNA, although other purines such as caffeine, hypoxanthine, and theobromine also exist. Pyrimidine: A nitrogenous base similar to benzene (a six-membered ring) and includes cytosine, thymine, and uracil as bases used for DNA or RNA. Recombination: The process of joining one molecule of DNA to another molecule. Recombination can either occur through the exchanging of similar sequences, called homologous recombination; the joining of ends of unrelated sequences together, called nonhomologous end joining; or the falling off of polymerases from one template strand and onto another, creating a recombinant sequence. DNA that has been recombined is called recombinant DNA. Replication: Specifically, the process of taking double-stranded DNA and making two identical copies of that double-stranded DNA. Each strand serves as a template for daughter strand synthesis. Replication fork: The structure that forms when DNA is replicating. When the DNA is unwound by helicases for DNA replication, one strand, called the leading strand, is synthesized while the DNA
Appendices
35
unwinds, whereas short sequences are filled in on the other strand, called the lagging strand, to make two daughter double helices. The junction where the DNA is split and the leading strand and lagging strands are being replicated is called the replication fork. Repressor: A protein that binds to a DNA sequence upstream of a gene or somewhere at the beginning of a gene and stops gene expression. A repressor should not be confused with a depressor or the person who kills all the fun at a party. Restriction enzyme: A specific nuclease that cuts double-stranded DNA at a specific sequence. Restriction enzymes were first discovered in bacteria, which use these enzymes as a defense mechanism against foreign DNA. Retrovirus: A virus that has an RNA genome and uses the enzyme reverse transcriptase to generate a double-stranded DNA genome. Retroviruses are typically used for gene therapy or other biotechnology applications. FYI: They are not useful for 70s theme parties. Reverse transcriptase: A DNA polymerase (enzyme) that copies from RNA templates. Reverse transcriptase is sloppy and error prone, due to a lack of proofreading. It also does not tuck in its shirt. Reverse complement: A strand of DNA that is in the opposite orientation and is complementary to the first strand. It is not a compliment that is really an insult. In double-stranded DNA, one strand is in a 50 to 30 direction, whereas the other is in the 30 to 50 direction, as you look at it from left to right, or top to bottom. Therefore, with respect to any strand, the other strand is the reverse complement. Ribonucleic acid: A series of nucleotides with ribose (a five-membered ring) as the backbone sugar. Ribonucleic acid often codes for proteins like messenger RNA (mRNA), or may be functionally active like ribozymes, including ribosomal RNA (rRNA) molecules, or may function in translation, like in transfer RNA (tRNA) by serving as the link between codons and amino acids. Ribose: A pentose (five-membered ring) sugar with the formula C5H10O5. Ribose is the sugar that serves as the backbone for RNA. Ribosomal RNA: The part of RNA that is in the ribosome and helps mRNA code into amino acids. Ribosome: A complex of RNA and protein that converts a messenger RNA (mRNA) sequence into a series of amino acids. It serves as the protein factory of the cell. Ribozyme: If you read this word, and thought to yourself, “Sounds like enzyme!” you are pretty close. An RNA molecule that catalyzes chemical reactions. “Ribozyme” comes from ribonucleic acid enzyme. RNA: Ribonucleic acid. RNA is a macromolecule (again, think “big”) composed of phosphate groups (aka eH2PO4R, where R is a functional group), ribose sugars, and the nucleotides adenine, guanine, cytosine, and uracil. It functions as a go-between for DNA and proteins as messenger RNA, or mRNA; as a key player in protein synthesis as ribosomal RNA, or rRNA; as a bridge as transfer RNA, or tRNA; as a gene regulator as small interfering RNA, or siRNA; and as microRNA, or miRNA. Why are there so many different kinds of RNA? We were wondering the same thing. Do not worry about all the different types for now. SsDNA: Single-stranded DNA. Supercoil: An overwound circular piece of DNA. It has no superpowers, beyond being really twisted. Stop codon: A three-base sequence that tells the ribosome to stop adding amino acids to the growing peptide chain. Stop already, please! The stop codon causes the ribosome to release mRNA and protein. Telomerase: An enzyme that, during each round of DNA replication, replaces the lost DNA sequence with a repeat sequence. Loss of telomerase activity generally leads to cell death. (Sad face.)
36
Chapter 2 The amazing human DNA system explained
Telomeres: The ends of DNA in vertebrates. Telomeres are highly repetitive sequences, partially due to the activity of telomerase, an enzyme(-ase). Shortening of telomere length is important for controlling the age of a cell. Cancer cells and other diseased cells have abnormal telomeres. Telophase: The antithesis of prophase, where daughter chromatids begin to decondense as the nucleus reforms for each future daughter cell. These are not unique cells yet, as this phase is followed by cytokinesis, where the two cells are finally separated. Termination: The completion of a metabolic process, such as DNA replication, RNA transcription, or translation. There are different cues for each: DNA replication terminates by replication forks running into each other or reaching the end of DNA; RNA transcription terminates with either the formation of a 30 hairpin loop or the arrival at a polyadenylation signal; and translation terminates with the stop codon. Thymine: One of four nucleic acids that compose DNA. Thymine is replaced with uracil, a variant of thymine without a methyl group, in RNA. Thymine is a pyrimidine derivative, like cytosine and uracil. Thymine was first isolated from a calf thymus by Albrecht Kossel. Topoisomerase: An enzyme responsible for unwinding DNA, generally by cutting one strand of DNA to relieve the torsional force that causes supercoiling. Topoisomerases relax the DNA structure so that helicases can enter and unwind the helix. Transduction: The insertion of foreign DNA into a bacterial cell by means of a bacteriophage. The bacteriophage injects DNA into the cell, along with factors that help the integration of the DNA into the bacterial chromosome. Transfer RNA (tRNA): The molecule of RNA that helps messenger RNA (mRNA) transfer the three-letter genetic code into the 20-letter code of amino acids. Transformation: The process by which bacteria nonspecifically take up genetic material. Transformation has been used by biotechnologists to amplify plasmids and to express various proteins. The usefulness of transformation can be seen in the “Spiderman and Other Examples of Recombinant DNA” subsection of “In the Real World.” Transcription: The process of generating an RNA copy of a gene, which is initiated by transcription factors binding to double-stranded DNA at a promoter sequence, and then recruiting RNA polymerase to transcribe a gene. Transcription factor: A protein that binds DNA and activates transcription. Translation: The synthesis of proteins from an mRNA template. Each codon sequence on the mRNA copy of the gene encodes a specific amino acid that is recognized by the anticodon sequence of the tRNA. It has a specific amino acid based upon the tRNA anticodon sequence. Translation is mediated by ribosomes and terminates when a stop codon is reached on the mRNA template. Uracil: One of four nucleic acids that compose RNA. Uracil replaces thymine in RNA and is also a pyrimidine derivative, like cytosine and thymine. Uracil was first isolated in 1900 from yeast hydrolysis. Vector: A plasmid that is used specifically for genetic engineering and biotechnology purposes. Vectors are constructed for specific purposes and typically have an antibiotic resistance gene as well as a multiple cloning site, or a site with various restriction endonuclease recognition sequences.
References Amazing facts about human DNA: http://scienceandnature.org/Articles/Human_DNA.pdf https://www.hackread. com/microsoft-planning-to-use-dna-for-data-storage/.
References
37
DNA computing: https://pt.slideshare.net/sajan45/dna-computing-44842686/5 https://www.microbe.net/simpleguides/fact-sheet-dna-rna-protein/. https://science.howstuffworks.com/life/cellular-microscopic/dna3.htm. DNA revolution: https://www.nationalgeographic.com/magazine/2016/08/dna-crispr-gene-editing-science-ethics/. Genes and chromosomes: https://www.merckmanuals.com/home/fundamentals/genetics/genes-and-chromosomes. Gene editing: www.wired.com/story/what-is-crispr-gene-editing/. Hacking: https://www.nature.com/articles/ d41586-018-05769-8. Hacking the U.S. president: https://www.theatlantic.com/magazine/archive/2012/11/hacking-the-presidents-dna/ 309147/?single_page¼true. https://www.linkedin.com/pulse/dna-day-amazing-things-you-did-know-your-anu-acharya/. https://www.quora.com/How-does-DNA-hold-genetic-information-Are-A-T-C-and-G-the-genetic-information. https://news.nationalgeographic.com/news/2010/02/100216-king-tut-malaria-bones-inbred-Tutankhamun/. IEEE article: https://spectrum.ieee.org/semiconductors/devices/exabytes-in-a-test-tube-the-case-for-dna-data-storage. Music: https://phys.org/news/2017-09-items-music-anthology-eternity-dna.html Genetic engineering: https:// www.yourgenome.org/facts/what-is-genetic-engineering Central dogma: https://science-explained.com/theory/dna-rna-and-protein/. Music: http://listverse.com/2017/10/28/10-incredible-things-scientists-did-with-dna-for-the-first-time/. National public radio on DNA: https://www.npr.org/news/specials/dnaanniversary/. Statement: https://stevemorse.org/genetealogy/dna.htm https://www.yourgenome.org/facts/what-is-crispr-cas9. The future of data storage: https://thehackernews.com/2017/03/dna-data-storage.html. What is CRISPR: www.yourgenome.org/facts/what-is-crispr-cas9. https://www.digitaltrends.com/cool-tech/ what-is-crispr-a-beginners-guide/.
CHAPTER
The miraculous anatomy of the digital immunity ecosystem
3
Creativity is a wild mind and a disciplined eye. dDorothy Parker
I don’t have a clue; to know what you’re going to draw, you have to begin drawing. dPicasso once said when asked where his ideas came from.
One of the brain’s greatest feats is its ability to adapt when challenged. Exercising the brain can even change its architecture. Researchers found that gray matter in the posterior hippocampus, a brain area linked to memory, increased in London cabbies after intensive study in the city’s labyrinthine streets. dNational Geographic issue, “Science of Genius”
Introduction We all remember the famous business metric, “If you can measure it, you can control it.” Well, the Internet has become a mega gravitational force ushering in a new era of prosperity and cybercrime. We are riding the Internet whether we like it or not. Cybercrime and terrorism have been making drastic black holes in our societal fabric and thriving on profound immorality and political poisoning. The World Trade Center disaster was, above all, a failure of imagination, as lamented by senator Thomas Keen during the 9/11 Commission report. This chapter focuses on two pivotal topics. First, we describe the anatomical structure of the digital immunity ecosystem (DIE), which creates a “replica” of the human immunity system and the intelligent digital immunity system, with smart components designed in artificial intelligence (AI) and nanotechnology. Second, we discuss the rational justification of using AI and nanotechnology to build a distinctive and unique defense for smart cities. In sum, we will show how AI will interface with the Cognitive Early Warning Predictive System (CEWPS) and upload human intelligence, which will augment its defense capabilities by a million-fold. The November 2014 issue of Forbes magazine carries an article with a big title: “America’s Critical Infrastructure Is Vulnerable to Cyber Attacks.” It is a wakeup call article because the whole country is on thin ice with no real defense system in place to protect our critical systems. There’s a lot of static noise and no kinetic movement forward. According to the Association of American Publishers, there are between 2001 and 2016, over 1500 published books on cybersecurity that talk about the rehashing of the same technology of the antivirus industry. Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00003-3 Copyright © 2020 Elsevier Inc. All rights reserved.
39
40
Chapter 3 The miraculous anatomy
We highlighted the amazing “soft cyborg” technologyda kind of ugly and mean AI-centric nanobot, and how it may go berserk and spew unlimited quantities of “gray goo” that will blanket the Earth, smothering all life; in other words, it is our final battle of Armageddon. On May 11, 2017, President Trump issued an executive order on strengthening the cybersecurity of federal networks and critical infrastructure. In the executive order, he highlighted: Within days of the date of this order, the Secretary of State, the Secretary of the Treasury, the Secretary of Defense, the Attorney General, the Secretary of Commerce, the Secretary of Homeland Security, and the United States Trade Representative, in coordination with the Director of National Intelligence, shall jointly submit a report to the President, through the Assistant to the President for National Security Affairs and the Assistant to the President for Homeland Security and Counterterrorism, on the Nation’s strategic options for deterring adversaries and better protecting the American people from cyber threats. Artificial intelligence (AI) and nanotechnology (NanoTech). We discuss the rational justification of using AI and nanotechnology to build a distinctive and unique defense for smart cities. In sum, we will show how AI will interface with the CEWPS security system and upload human intelligence, which will augment its defense capabilities by a millionfold. The November 2014 issue of Forbes magazine carries an article with a big title: “America’s Critical Infrastructure Is Vulnerable to Cyber Attacks.” It is a wakeup call article because the whole country is on thin ice with no real defense system in place to protect our critical systems. There is a lot of static noise and no kinetic movement forward. According to the Association of American Publishers, there are between 2001 and 2016, over 1500 published books on cybersecurity that talk about the rehashing of the same technology of the antivirus industry. We highlighted the amazing “soft cyborg” technologyda kind of ugly and mean AI-centric nanobot, and how it may go berserk and spew unlimited quantities of “gray goo” that will blanket the Earth, smothering all life; in other words, it is our final battle of Armageddon. On May 11, 2017, President Trump issued an executive order on strengthening the cybersecurity of federal networks and critical infrastructure. In the executive order, he highlighted: Within 90 days of the date of this order, the Secretary of State, the Secretary of the Treasury, the Secretary of Defense, the Attorney General, the Secretary of Commerce, the Secretary of Homeland Security, and the United States Trade Representative, in coordination with the Director of National Intelligence, shall jointly submit a report to the President, through the Assistant to the President for National Security Affairs and the Assistant to the President for Homeland Security and Counterterrorism, on the Nation’s strategic options for deterring adversaries and better protecting the American people from cyber threats. On January 13, 2015, President Obama gave a speech at the National Cybersecurity Communications Integration Center in Arlington, VA. He ended his speech by saying: “And as long as I’m President, protecting America’s digital infrastructure is going to remain a top national security priority.” Both presidents are saying the same political rhetoric, which made our adversaries more active and aggressive in their attempts to attack the United States. Poking a hornet’s nest is not a constructive act. As necessity is the mother of invention, and to control global malware such as cancer, we need a radical change of direction, mindset, and tools to revolutionize cyber defense. Nanomedicine is catching up with cancer. Other medical nanorobots can serve as cleaners, removing unwanted debris and chemicals (such as prions, malformed proteins, and protofibrils) from individual human cells.
Introduction
41
In the domain of engineering, nanomaterials can be used to create highly sensitive sensors capable of detecting hazardous materials in the air. For example, carbon-based nanotubes are relatively inexpensive and consume minimal power. Other areas of nanotechnology pertinent to home-land security are emergency responder devices. Lightweight communications systems that require almost no power and have a large contact radius would give rescuers more flexibility. Nanotech robots could be used to disarm bombs and save trapped victims, reducing the risks to rescue workers. However, the most exciting application of nanotechnology has been in the field of cybersecurity, with the design of digital immunity and the Smart Vaccine (SV) with the creation of nanobots. Nanobot engineering is defined as the engineering of “smart” functional systems at the molecular scale. Nanotechnology is defined as “the design, characterization, production and application of structures, devices and systems by controlled manipulation of size and shape at the nanometer scale (atomic, molecular and macromolecular scale) that produces structures, devices and systems with at least one novel/superior characteristic or property.” The SV is an intelligent nanobot, programmable to do a specific function in Digital Immunitydas shown in Fig. 3.1. As it is scaled at the molecular level, it can be a component of a chip, it can roam through computer cables, it can reside on a storage device, and it can communicate with other SVs. That is why it was called the SV. It is the counterpart of the B cells, or T-cells, in human immunity. We can also compare it to the neuron that transfers sensory and motor signals from and to the brain. The scaling down of CWEPS/SV to the nanomolecular level is the advantage that places the system ahead of its time. In this chapter and the following one, we will be describing in more detail the anatomy of the SV nanobot.
FIGURE 3.1 By leveraging medical knowledge that leads us to immunity vaccines, we can take a similar approach to launch a new chapter in enterprise systems immunity. The SV is the replication of the human immune system. The idea is fascinating and straightforward. We build a “defense by offense” mechanism to immunize any critical infrastructure or mission-critical enterprise system. This is the only way to outsmart cyber malware.
42
Chapter 3 The miraculous anatomy
What is the Smart Vaccine? Nanobot engineering is defined as the engineering of “smart” functional systems at the molecular scale. The Nanotechnology dictionary, www.nanodic.com, defines Nanotechnology as “the design, characterization, production and application of structures, devices and system by controlled manipulation of size and shape at the nanometer scale (atomic, molecular and macromolecular scale) that produces structures, devices and systems with at least one novel/superior characteristic or property.” The SV is an intelligent nanobot, programmable to do a specific function in digital immunity. Fig. 3.1 gives you the exploded view of SV. As it is scaled at the molecular level, it can be a component of a chip, it can roam through computer cables, it can reside on a storage device, and it can communicate with other SVs. That is why it is called the “Smart” Vaccine. It is the counterpart of the B cells or T cells in the human immunity system. We can also compare it to the neurons that transfer sensory and motor signals to and from the brain. The scaling down of CEWPS/SV to the nanomolecular level is the advantage that places the system ahead of its time. In this and the following chapter, we describe in more detail the anatomy of the SV nanobot.
Smart cities are like the human body The closest analogy of the smart city (SC) ecosystem is the human body, which is holistically administered by the brain and 12 federated systems that perform autonomically with perfect orchestration. No waterway on Earth is so complete, so commodious, or so populous as the wonderful river of life, the “Stream of Blood.” The violin, trumpet, harp, grand organ, and all the other musical instruments are mere counterfeits of the human voice. Fig. 3.2 gives a unique assembly of the human organs as an isometric engineering drawing. Another marvel of the human body is the autonomic self-regulating process of ventilation, by which nature keeps the body temperature in health at 98 F. Whether in India, with the temperature at 130 degrees, or in the arctic regions, where the records show 120 degrees below freezing point, the body temperature remains the same, despite the extreme to which it is subjected. It was said that, “All roads lead to Rome!” Modern science has discovered that all roads of real knowledge lead to the human body. Moreover, you think cities are crowded now? For the first time in history, more than 50% of the world’s population lives in cities. By 2030, more than 5 billion people will live in urban settings. However, before we get to that kind of population density, we have to optimize our cities. We need to make them smarter, safer, and above all more secure. Yes, technology can help. Modern cities compete with one another to attract businesses, talent, skills, and taxpayers. As a result, city administrations are becoming entrepreneurial, valuing innovation, technology, marketing, and communication. The smart city ecosystem is a broad partnership between the public and private sectors. City planners and developers, nongovernmental organizations, IT system integrators, software vendors, energy and utility providers, the automotive industry, and facility control providers, as well as technology providers for mobile technology, cloud computing, networking, system to system (S2S), and radiofrequency identification (RFID), all have a role to play. In addition, like the human body, component connectivity is one of the principal prerequisites for SC design. Smart cities live by their smart grids, which are the nervous system that allows cities to live
Smart cities are like the human body
43
FIGURE 3.2 The closest analogy of the Smart City ecosystem is the human body that is holistically administered by the brain and 12 federated systems that perform autonomically with perfect orchestration.
and breathe, to facilitate the exchange of information, and to respond promptly to danger (e.g., in case of a massive cyberattack on the city, which is highly probable). The digital immunity smart grid (DISG) will be the best savior. It is where all the avant-garde technologies converge (AI, nanotechnology, cybernetics, and human intelligence) that brings the new paradigm shift in cybersecurity. With all the serial assaults of cybercrime, technology alone has not been able to win the battle or the war. The integration of human intelligence with technology will create an ultraintelligent digital immunity environment that will surpass all the intellectual activities of city attackers. The CEWPSdwe also refer to it as the DIEdprovides all the intelligent nanocomponents that make up the city’s digital immunity smart grid. CEWPS is a thinking machine that reasons with its deep store of human biological knowledge written into its DNA. CEWPS is designed with human intelligence combined with technology intelligence. Dr. Eric Drexler from MIT described the intelligence fusion concept: “As I discuss in Engines of Creation, if you can build genuine AI, there are reasons to believe that you can build things like neurons that are a million times faster. That leads to the conclusion that you can make systems that think a million times faster than a person.” CEWPS is like a bullet train compared to a steam locomotive representing the antivirus technology (AVT) of today. Similar to the brain is the central nervous system including the spinal cord, CEWPS is the central intelligent system that controls the activities of the city’s DISG that spreads across the city with a digital grid protecting all the critical systems of infrastructures and the sensor devices of the Internet of
44
Chapter 3 The miraculous anatomy
Things (IoT) as shown in Fig. 3.3. The DISG is the combination of the nervous and immune systems together and has the following autonomic responsibilities: • • • • • •
Send and receive real-time alerts Send status information to central command Receive action information from central command Initiate defense by offense Maintain regular vaccination (health) of all smart city systems Collect and document attack information
FIGURE 3.3 This is the top view of the smart city under a cyberattack. All critical systems physically and logically reside on the city smart nanogrid and are connected with adapters to the SV nanogrid (outer grid). When the attack profile is recognized and located on the grid, a convoy of SV fighters rushes to the battlefield and the vaccination process is instantiated. The attacking virus is captured, and its attack vector is analyzed; then, it will be destroyed before it mutates and multiplies. SV has two types: one to capture the virus and quarantine it; the second type is to digest it and keep the remains for the next battle. Once the battle is documented, it will be added to the attack/virus knowledge base (VKB).
CEWPS is the intelligent smart shield of the smart city
45
What is a smart city? By definition, a smart city is a complex polynomial that includes smart government, smart citizens, smart technologies, and more importantly, smart futures. Predicting the future has always been a human endeavor since the beginning of time. History tells us interesting stories about how the future was predicted. Romans, Egyptians, and Greeks had High Priests who influenced the rules with predictions about calamities, sickness, and wars. Temples were holy places where mysterious rituals were performed by priests including sorcery, exorcism, and astrology. Today, we consider all these rituals nonsense without a true scientific basis. Keeping the city thriving on quality living is critical. Transportation strategies have an impact on public safety, the environment, energy, rapid response services, and the ability to do business. Critical deliveries must be fit precisely in this gigantic jigsaw puzzledall while keeping the general quality of life. Real-time traffic flow information, coupled with Telco, global positioning systems (GPS), machine-2-machine communication, Wi-Fi and RFID technologies, as well as data analytics and prediction techniques, can all be used to enhance private and public travel. Smart cities live by their sensors that collect information about traffic conditions at critical city spots and send the data via wireless or GPS communication to centralized control systems. This data can, for example, influence in optimizing traffic light synchronization.
CEWPS is the intelligent smart shield of the smart city We would like to introduce basic definitions of some important terms: The CEWPS is the system (hardware and software). Digital immunity (DI) is the “flesh and nerves” of CEWPS. Fig. 3.4 shows the major layers that pathologically communicate with each other. SV is like the human fighting B cell, the nanobot is the software that eliminates adversary nano cyborgs. CEWPS is the convergence of three futuristic technologies that will establish its leadership for the next two decades: The first technology is human intelligence, where the combination of human-level intelligence with a computer’s inherent superiority in speed, accuracy, and memory-sharing ability will be formidable. The second technology is AI, which allows CEWPS to become a digital mind with greatly superior serial power that can run on a faster timescale than we do and can reason and make decisions on its own imitating human knowledge and experience. The third technology is nanotechnology (NT), which is the most successful technology until now in the world of the scientific industry. Nanotechnology empowered CEWPS with incredible features that are not present in any security system today. Nanotechnology introduced the concept of miniaturization that allowed CEWPS to build software robots (nanobots) at a molecular scale. By the 2020s, nanobot technology will be viable in autonomic computing and SV applications. Nanobots are robots that will be the size of human blood cells (7e8 mm) or even smaller; 44 billion nanobots can travel through smart grids, carrying vaccination services and using high-speed wireless communication to interface with other nanobots and smart city central command computers.
46
Chapter 3 The miraculous anatomy
FIGURE 3.4 The architectural representation of CEWPS, the thinking machine of digital immunity, which immunizes all the smart city devices. CEWPS is the composite of three enabling technologies: Human intelligence, which provides experience and cognitive power; AI, which expands the intelligence of CEWPS and provides microscopic SVs (nanobots); and nanotechnology, which brings miniaturization and higher performance in spreading digital immunity million times faster than any software program, and any cyberattack before reaching the smart city.
The 3D nanoattack scenario Fig. 3.5 represents a 3D futuristic scenario of nano defense of the smart city. CEWPS, which is an IA-centric intelligent and predictive system, will create the DI defense with the help of the Smart Vaccine grid (SVG). CEWPS is designed to integrate cloud computing, encryption technology, wireless computing, and nanobot logistics. CEWPS operates everything in real-time mode, from its command and control center (CCC), in coordination with the city coordination center. CEWPS has a whole menu of crucial responsibilities, not only to destroy massive incoming attacks but to preserve a copy of the payload and its structure of the VKB for future attacks. More importantly, CEWPS will keep all the smart city systems routinely immunized. We will discuss how the SV commander will command the army of vaccinators (nanobots) to defeat and eradicate all attacks.
Anatomy of CEWPS and its intelligent components
47
FIGURE 3.5 The 3D defense by offense scenario of the CEWPS system. The nanobot army is going after the DDoS; the nanobot vaccinators are immunizing the city’s smart grid. The battle episodes are documented and stored in the battle knowledge engines.
One of the most advanced CEWPS is its cognitive intelligent early warning predictive (EWP) system. It will revolutionize preempted asymmetric and stealth attacks. The system is equipped with smart sensors on microchips that will communicate with other friendly clouds and satellites and pass this information to the central component of the predictive system for analysis and accurate prediction. Telecom will be more reliable, faster, and have wider bandwidth. The fifth generation of wireless will evolve, and nanotechnology will create newer generations. We will discuss the internals of CEWPS in the next section.
Anatomy of CEWPS and its intelligent components CEWPS comes with an arsenal of offensive and defensive systems. It is designed specifically for use in a defense-by-offense strategy to protect the smart city and its large metropolitan area. CEWPS is, in fact, an incredible “thinking machine” that controls a multitude of autonomic subsystems that are connected through real-time sensors. This book often presents CEWPS with different configurations for simplicity reasons. CEWPS architecture has nine subsystems orchestrated by the CCC, which acts as the nervous dashboard center. Fig. 3.6 shows how the nine smart unsupervised learning components collaborate while they operate autonomically together.
48
Chapter 3 The miraculous anatomy
FIGURE 3.6 The pathological integration of nine operating intelligent components is critical. The subsystems receive commands from the CCC. During an attack, all the nanoarmies of SVs (B-cells) are mobilized and fight together. The main focus is to protect the critical systems from the attack by vaccination and, at the same time, capture the virus, determine its identity, and save it for the next attack.
Anatomical composition of CEWPS (the digital immunity ecosystem) CEWPS component 1: The Central Coordination Center The Central Coordination Center is the nerve center of CEWPS. It is the brain of CEWPS. The center has several technical responsibilities that have to be met for it to guarantee adequate protection and safety of the critical system. It is aware of all the security activities in the yard. It receives status data from all the subordinate systems and dispatches instructions and performatives based on the situation. The DIE comes with a multiscreen information dashboard, as shown in Fig. 3.7 and a sample real-time performance data screen in Fig. 3.8. CEWPS draws its predictive clout from its accumulative knowledge reservoir. CEWPS collaborates with a group of prominent AVT providers who supply data on the latest cyberattacks in the world. Building a synergistic alliance with AVT providers gives CEWPS higher credibility, and more importantly, feeds CEWPS with a steady, fresh flow of malware data ready to be converted into vaccine knowledge.
Anatomical composition of CEWPS (the digital immunity ecosystem)
49
FIGURE 3.7 The isometric projection of the CEWPS/DIE technology platform shows all the enabling technologies that are integrated in synch to produce a compelling digital immunity ecosystem, including DNA storage.
CEWPS component 2: The knowledge acquisition component Here, wisdom, experience, and knowledge are combined. By analogy, we can compare the acquisition process to our sensory nervous system, which is made of cells that collect external sensory signals for distillation, filtration, and conversion into stimuli stored in the brain, before motor signals are instantiated and sent to the attack area. Here are some of the basic terms that describe crucial “associative responses” of the system:
What is experience? Experience is a conscious event that creates a brain process and gets stored internally in the sensory cortex of the brain as neural code. Once an experience episode gets into the brain, a memory record is created and is ready to be stored. There are three types of memory records: first, in the sensory stage that is in the front end. The registration of information during perception occurs in the brief sensory stage that usually lasts only a fraction of a second. It is your sensory memory that allows a perception, such as a visual pattern, a sound, or a touch, to linger for a brief moment after the stimulation is over.
50
Chapter 3 The miraculous anatomy
FIGURE 3.8 There are 20 dashboard screens, depending on the situation in the smart city and the nature of the attack. Screens flash out automatically depending on the defense status of the smart vaccine grid.
Second, after that first flicker, the sensation is stored in short-term memory, which has limited capacity; it can hold about seven items for no more than 20 or 30 s at a time. Third, long-term memory can store unlimited amounts of information indefinitely. People tend to more easily store material on subjects that they already know something about, as the information has more meaning to them and can be mentally connected to related information that is already stored in their long-term memory. That is why someone who has an average memory may be able to remember a greater depth of information about one subject. Most people think of long-term memory when they think of “memory” itself. Biologically, information must first pass through sensory and short-term memory before it can be stored as a long-term memory. Experience, Ej, is an event, or an episode, that we participate in or live through. We all learn from experiences, regardless if they are ugly, bad, or happy. Experience is cumulative and gets transformed into knowledge later. Experience is gained by repeated trials and leads to heuristic knowledge. We all build experience in life. Quantitatively speaking, experience is a function of time f(t) with a start time Ej(t1) and a finish time Ej(t2). Experience duration is expressed as t2t1. Once the experience is stored in the brain, it gets magically assembled with similar neural codes and becomes an episodic record of knowledge.
Anatomical composition of CEWPS (the digital immunity ecosystem)
51
CEWPS considers a cyberattack as an independent, discrete event that has a start and finish time. However, knowing about past cyber-attacks Ej1 (a priori) can help to forecast incoming attacks. CEWPS is a smart machine and it is built for one purpose in mind: to catch cyberattacks before they occur. That is the magic of CEWPS.
What is knowledge? Knowledge, on the other hand, is different from experience. It is the derivative of experience, as shown in the equation below: Kt [ dðEt Þ=dt where E ¼ experience, K ¼ knowledge, dt ¼ time. Knowledge can be defined as the fact or condition of knowing something with familiarity gained through experience or association. A knowledge engine takes disparate experience episodes, Et (experience), and converts them into a knowledge pattern, kt, (knowledge) and catalogs them in the brain for a subsequent neural response. Intelligence is another human characteristic, but it refers to the fast ability to retrieve knowledge, connecting pieces of knowledge together, or processing knowledge quickly. CEWPS stores a priori (from the past) knowledge extracted from previous cyberattacks on the smart grid. Attackers know that there is a smart grid and they design their attack vector to penetrate the grid from the weakest side and hide until the time comes for spreading to the center of the grid.
The six stages of a cybercrime episode No one has ever analyzed a cyberattack on a smart grid in a smart city. Let us not forget that the smart grid is not only a power grid, it is a resilient and secure network that connects all the critical infrastructures together. Smart grids are the new paradigm and technology and security vendors are jockeying to learn about them and introduce their new products. Smart grid cyberattacks will usher at the beginning of the new cyberwar and we had better be ready for it. CEWPS is designed to defend smart grids, which are the backbones of future smart cities. As we described earlier, a cyberattack is an unknown event that can be represented as a Bayesian network (BN) model describing the conditional dependencies in probability distributions. We can deduce from experience that all cyberattacks have similar patterns and follow the same six stages as shown in Fig. 3.9. So, the attack on the city smart nanogrid must be well studied, engineered, and executed. We are implementing nanotechnology as an innovative approach to design the smart city grid as well as the SV nanogrid. In the following chapters, we discuss in detail the advantages of the new model of smart grids. The only way to penetrate the nanogrid is to use similar technologies. Cyberterrorists will eventually implement nanotechnology hacking tools to penetrate smart cities. Most of the cyberattacks on large installations or critical infrastructures are done with internal help. Nowadays, cyberattacks are driven mostly by political and religious motives. Cyberterrorism is planned and executed by domain experts who know where the most vulnerable spots are in major cosmopolitan cities in the world. Each country has its own electronic army, which is equipped with the latest symmetrical malware technologies. Cyberterrorists learn from their successes and failures and will try to launch several attacks at
52
Chapter 3 The miraculous anatomy
FIGURE 3.9 DDoS attackers must have a large reservoir of knowledge, a bullet-proof plan, and a wide-angle vision. There are six stages to a nano cyberattack. The attack is just as good as its plan. The success of one will trigger successive attacks with learning from previous ones. Almost all attacks follow the same pattern. Skipping one stage will make the success less successful.
different locations on the grid. Most likely, organized electronic armies have similar knowledge engines to CEWPS, where they model attacks and develop new payloads. In other words, smart cities are attacked by smart attackers. The FBI, the Dutch KLPD (Korps Landelijke Politie Diensten), the UK Interpol, the French Suˆrete´, and the German BKA (Bundescriminalamt) all have automated fingerprint identification systems that share among themselves to catch serial cyberterrorists. However, many cybercriminals were one-time security analysts with great knowledge of the state of the art of the present antivirus technologies. CEWPS will be a great help for law enforcement agencies who are starting to use advanced technologies to track malware enemies. CEWPS collects crime data from many global sources, which will be used in predictive reasoning in the inference engine. However, the focus in CEWPS is more on preattack than on postattack.
Cybercrime raw data distillation process The world is full of crime cases that do not come in a different formats and languages. In order to benefit from these crime cases, they need to be identified, collected and structured under one format. Then each case will be converted into data model and stored in the Attack Knowledge Base, as shown in Fig. 3.10. The distillation process is made of 5 consecutive steps, then CEWPS can perform predictions
Catalogued Data Models
Ontological and Semantic Transformation
Anatomical composition of CEWPS (the digital immunity ecosystem)
Attack Knowledge Base
Step-5
US/Global Intelligence Grids Local Law Enforcement Monitoring Sources
Structured Crime Data
Step-4
53
Step-3
Disparate Unstructured Crime Data Step-2
Step-1
The Distillation Process
Unstructured attack cases are collected, filtered and transformed into uniform patterns (Data Models)
FIGURE 3.10 Unstructured attack cases are collected from different sources, filtered, and transformed into uniform patternsThe other components of CEWPS will be discussed later. An version of CEWPS will be incorporating AI and Nanotechnology.
Step 1: The raw data feeders: US/global intelligence agencies and local law enforcement agencies have to feed all fresh and historic data. The millions of crime cases will enrich the knowledge base and enhance the performance of the inference engine. Step 2: Raw input data: As the crime case data are coming from many different law enforcement agencies, research organizations, or other crime repositories, data will be disparate, redundant, structured with different formats, or even processed by different software systems. Each country, for example, has its own proprietary finger or face identification system. The data collector will take care of homogenizing the data. Step 3: Raw data collection: All unedited cybercrime episodes will be routed to an intermediate repository for cleaning and filtering. This process is pretty much like an iron mill where iron will be smelted and cast, or even like the oil refinery. Step 4: Ontological and semantic transformation: In this step, we combine the attributes of each crime case into an ontology record and then combine all similar crime cases together. We use ontology and semantics technologies as shown in Fig. 3.11 to standardize all the crime records, before passing them to the causality prediction engine (discussed later in this chapter). CEWPS comes with two knowledge bases. The first one is the attack knowledge base (Virus Library) and the second is the vaccine knowledge base (Vaccine Library). The two knowledge bases will be discussed in Components 8 and 9. Step 5: The attack knowledge base: The CEWPS attack knowledge base is a very smart engine with high-level autonomicity and intelligence. It is an “expert library” that contains significant archival crime episodes from around the world. The CEWPS AKB uses special cybercrime taxonomy codes. CEWPS uses nanotechnology’s miniaturization process to provide the SVGdwhich is a nanogriddwith a faster response time and accurate information.
54
Chapter 3 The miraculous anatomy
FIGURE 3.11 Relationship between ontology and semantics. A crime ontology record is a collection of related attributes tied together. We convert all crime ontology records into a standardized format called the knowledge model.
Cyberterrorism is the biggest global supply chain in the world, with a monitory flow of $50 billion per year. Cybercrime and its older brother cyberterrorism are the most affluent professions in the world. Forty-nine law enforcement agencies will be the regular crime data feeder into the CEWPS AKB. CEWPS will have the biggest crime “refinery” in the world. The CEWPS AKB engine will be running 24/7, trying to clean up the data. Similar to crude oil, raw crime episodes go into an extensive distillation process that converts the raw data into a uniform structure. The formatted crime episode will then be routed into the attack knowledge base and will be transformed into a “knowledge model” before it goes into the inference engine for predictive induction.
CEWPS component 3: the reasoning engine We would like to introduce two important terms used by the reasoning engine.
What is causality? Causality is the relationship between an event (the cause) and a second event (the effectwhich comes after the first event),and where the second event is understood to be the consequence of the first. It governs the relationship between events. There are, however, two cases of causality. First, necessary causes: If x is a necessary cause of y, then the presence of y necessarily implies the presence of x. The presence of x, however, does not imply that y will occur. Second, enough causes: If x is a sufficient cause of y, then the presence of x necessarily implies the presence of y. However, another cause, z, may alternatively cause y. Thus, the presence of y does not imply the presence of x.
Anatomical composition of CEWPS (the digital immunity ecosystem)
55
What is prediction? A prediction of this kind might be inductively valid if the predictor is a knowledgeable person in the field and is employing sound reasoning and accurate data. However, as a rule, predicting an event E(tf) to happen in the future (tf), is perfectly valid, if previously, one or several similar events did occur at the same place and time, E(tf1), E(tf2), E(tf3). We consider the probability that event E(tf) will occur provided event E(tf1) did occur. Using probability formalism, we can write: P (E(tf) jE(tf1)). In summary, we can deduce causal mechanisms from past data. Causality is an ingredient of the CEWPS reasoning engine. The reasoning engine driven by Deep Learning, is the “smart guy” component of CEWPS, often called as a knowledge-based system (KBS), which is an inference (reasoning) engine that relies on the Bayesian network model (BNM) to generate a probabilistic attack forecast. Some of the benefits of using BNM are as follows: • • • •
They are graphical models, capable of displaying relationships clearly and intuitively. They are directional, thus being capable of representing cause-effect relationships. They handle uncertainty through the established theory of probability. They can be used to represent both direct and indirect causation.BN is a set of local conditional probability distributions. Together with the graph structure, they are enough to represent the joint probability distribution of the domain Y PrðX 1 ; X 2 ; ..; X n Þ¼ PrðXi jPai Þ l¼1
where Pai is the set containing the parents of Xi in the Bayesian network.
We can forecast weather; why cannot we predict crime? Let us take a look at other forecast systems such as weather and stock forecasting approaches. Weather forecasting is the application of science and technology to predict the state of the atmosphere for a given location. Weather forecasts are made by collecting (first step) quantitative weather data from weather satellites about the current state of the atmosphere at a given place and using scientific understanding of atmospheric processes to project how the atmosphere will change. The second step is to enter this data into a mathematical weather model to generate credible prediction results.
Anatomy of the causality reasoning engine The CRE is an AI-based system. It is commonly known as the reasoning inference engine (RIE). This type of engine is the Holy Grail of AI science. It is a highly educated and fast-learning machine. Fig. 3.12 shows the three stages that describe how machine learning works, and the prerequisite components needed to achieve a reliable attack prediction. The digital immunity ecosystem (CEWPS) relies heavily on this precious component that requires machine intelligence merged with human knowledge and experience. It is one of the most advanced components of CEWPS. We will talk more about it and get the reader to understand the complexities of AI and its application in machine learning (ML). An inference engine cycle includes three sequential steps: match rules, select rules, and execute rules. The execution of the rules will often result in new facts or goals being added to the knowledge base, which will trigger the cycle to repeat. This cycle continues until no new rules can be matched.
56
Chapter 3 The miraculous anatomy
Attack Payload
City Coordination Center Issue Alert to Critical Systems
Vaccine Knowledge Database
Virus Knowledge Database
Bayesian Networks Attack Models
Crime Domain Expert
Deep Reasoning Engine
Crime Domain Expert
Early Warning Predictor
Incoming Attack Clues
Facts Build The Smart Vaccine Dispatch Vaccine
Rules Inference Knowledge Base
3
Prediction Process
2
Inference Process
Attack Models Knowledge Base
Conversion
Raw Unedited Historical Attacks
1
The Distillation Process
© 2014 Copyright [MERIT CyberSecurity Group]; All rights are reserved
FIGURE 3.12 The causality reasoning engine is the GPS of CEWPS. Unsupervized and supervized machine learning are incorporated into the system. It is the early warning predictor that extracts information about previous attacks, the strategies of attack and defeat, and what kind of vaccine was used. The reasoning engine also gets information from the facts and rules.
The first stepdmatch the rules: The inference engine finds all of the rules that are triggered by the current contents of the knowledge base. In forward chaining, the engine looks for rules where the antecedent (left-hand side) matches some facts in the knowledge base. In backward chaining, the engine looks for antecedents that can satisfy one of the current goals. The second stepdselect the rules: The inference engine prioritizes the various rules that were matched to determine the order to execute them. The third stepdexecute the rules: This is where the engine executes each matched rule in the order determined in step two and then iterates back to step one again. The cycle continues until no new rules are matched. The AKB represents facts about the cyberattack, cybercriminal profile, victims, and the impact of the attack. The fact world is represented using classes and subclasses; instances and assertions were replaced by values of object instances. The rules work by querying and getting the right attack record. In addition to the input source of the AKB, two significant knowledge bases participate in the reasoning process. The virus knowledge base (VIKB) and the vaccine knowledge base (VAKB). They will be discussed later in this chapter.
The critical infrastructures in smart cities
57
CEWPS component 4: reverse engineering center The reverse engineering center (REC) is responsible for decomposing all unknown payloads from all attacks and learning everything about their code and technologies. The pathology reports on captured and quarantined viruses will act as catalogs and are stored in the VIKB. Information coming from other forensics centers will also be stored in the VIKB. Meanwhile, the corresponding antivirus vaccine will be stored in the VAKB. The central coordination center will receive daily bulletins from the REC. The smart vaccine center (SVC) is responsible for generating the SV nanobots and sending them to the city nanogrid as a response to cyberattacks. The tools of reverse engineering are categorized into disassemblers, debuggers, hex editors, and monitoring and decompiling tools: Disassemblers: A disassembler is used to convert binary code into assembly code and to extract strings, imported and exported functions, libraries, and so forth. The disassemblers convert the machine language into a user-friendly format. Different dissemblers specialize in certain things. Debuggers: This tool expands the functionality of a disassembler by supporting the CPU registers, the hex duping of the program, the view of the stack, and the like. Using debuggers, the programmers can set breakpoints and edit the assembly code at run time. Debuggers similarly analyze the binary code as the disassemblers and allow the reverser to step through the code by running one line at a time to investigate the results. Hex editors: These editors allow the binary code to be viewed in the editor and change it as per the requirements of the software. Different types of hex editors available are used for different functions. Portable executive and resource viewer: The binary code is designed to run on a Windows-based machine and has a very specific dataset that tells how to set up and initialize a program. All the programs that run on Windows should have a portable executable that supports the dynamic link library from which the program needs to borrow.
CEWPS component 5: smart city critical infrastructure The smart city critical infrastructure is a vital component of the CEWPS, which shields all critical systems of the city and keeps them in normal operating mode. The SVG is connected in real time to these devices and is monitored all the time. Fig. 3.13 is the areal view showing a variety of critical systems, connected to two grids: the city grid and the smart vaccine grid. In the domain of terrorism, the term “critical infrastructure” has become of paramount importance. In fact, certain national infrastructures are so vital that their incapacity or destruction would have a debilitating impact on the defense or economic security of the country.
The critical infrastructures in smart cities What is criticality? Fig. 3.14 shows the critical systems in the smart city. Criticality as a function of the type of criticality grouping. It is also appropriate to clarify the subject of criticality because it is very closely related to threats, risks, and attacks. Criticality is a relative measure of the consequences of a failure mode and its
58
Chapter 3 The miraculous anatomy
FIGURE 3.13 The 11 critical infrastructures with the respective criticality scale.
frequency of occurrences. To say that the power grid is highly critical means that a blackout will create a very grave impact. The power grid is very complex and has many interconnections and components. Cause for failure can be either human, mechanical, electrical, or in the design of the system. The resultant failure, therefore, can be either catastrophic, critical, or marginal. The failure mode is defined as how failure is observed. It describes the way the failure occurs, and its impact on equipment operation. A failure mode deals with the present, whereas a failure cause happened in the past and a failure effect deals with the future. Let us analyze the situation numerically: The formula of criticality due to a failure is Cm ¼ balp t Criticalitymode ¼ ðProbability of next higher failureÞ ðFailure mode rationÞ ðpart failure rateÞ ðDuration of applicable mission phaseÞ
What is a critical infrastructure?
59
FIGURE 3.14 The 11 critical systems that manage the infrastructure in a smart city ranked by criticality. There are six criticality levels. Power and energy were determined to be the most catastrophic in the case of a city or regional blackout.
Total item criticality (Cr) is the joint probability of the item under all the failure modes: X ðbalptÞn Cr ¼ n ¼ 1/ Cr ¼ Criticality probability n ¼ The initial failure mode of the item being analyzed where n ¼ 1,2,3,4 . j j ¼ The number of failure modes for the item being analyzed Severity is an attribute associated with the damage caused by the cyberattack. Fig. 3.15 shows an interesting relationship between failure probability as function of criticality.
What is a critical infrastructure? We keep using the term “critical infrastructure.” However, what does it mean in terms of a smart city? In a smart city, energy, water, transportation, public health and safety, and other key services are managed in concert to support the smooth operation of the critical infrastructure while providing for a clean, economic, and safe environment in which to live, work, and play. Timely logistics information
60
Chapter 3 The miraculous anatomy
FIGURE 3.15 Most of the attacks on infrastructures are considered “critical” and associated with high failure probability. Smart cities cannot afford to have infrastructures with a high level of vulnerability. Predictive analytics comes in handy too when considering predictable future scenarios.
will be gathered and supplied to the public by either the cloud, secure information highways, or social media networks. The energy infrastructure is arguably the single most important feature in any city. If unavailable for a significant enough period, all other functions will eventually cease. This is why CEWPS utilizes the smart grid to offer on-demand vaccination services to immunize the power grid as well as the other critical infrastructure systems. “Critical infrastructure” is a term coined by governments to represent the backbone of the smart city’s economy, security, and health. People know it as the power they use at home, the water they drink, the transportation that moves them, and the communication systems they rely on to stay in touch with friends and family. The corollary of this can be stated as “A smart city cannot exist without a smart grid protecting its critical infrastructures.” How did a particular infrastructure become critical? This is an interesting question. Infrastructures were born with cities. The oldest infrastructures were aqueducts and roads. The Egyptians built canals and irrigation systems. They did not make so many roads. Roads were not so important because they relied on the Nile for transportation. The Romans build aqueducts, roads, and bridges. As cities became more and more populated, the infrastructures became more important, and additional sections were added to the old ones. Amazingly, Tokyo is rated as the third most impressive “smart city” on Earth shown in the following weblink. http://freshome.com/2013/02/07/10-most-impressive-smart-cities-on-earth. This is the secret: After months of rolling blackouts due to lack of nuclear power, the need for the Japanese to innovate has never been greater. Japan’s biggest companies are behind the smart city revolution taking place around the globe and are using Tokyo as their proving ground.
What is a critical infrastructure?
61
Panasonic, Sharp, Mitsubishi, and many other big names are working very hard to infuse smart technology into this massive city. Engineering, medicine, and the military became the most important elements of civilizations’ survival. Engineering’s job is to build a healthy city; medicine’s job is to maintain healthy citizens; and the military’s purpose is to defend the city. George Stephenson invented the steam locomotive engine in 1820. Karl Benz invented the modern car in 1879. The Wright Brothers invented the airplane in 1903. These three inventions brought to the modern world three new infrastructures. Thomas Edison gave us the grace of electricity in 1879; Alexander Graham Bell gave us the telephone in 1876; Thomas Johnson Watson gave us the IBM computer in 1953. Then, Dr. Leonard Kleinrock, professor of Computer Science at UCLA, and Vinton Cerf gave us the Internetdthis was the beginning of the electronic “big bang” that we are living in today. The reality of living defies predictions and forecasts. Today, 54% of the world’s population lives in urban areas, a proportion that is expected to increase to 66% by 2050. According to Table 3.1, Just 10 years ago, the number of Internet users was 910 million, with a world population of 6.4 billion. These numbers have jumped to a whopping 3.1 billion users with a world population of 7.2 billion. Cities are becoming more crowded and noisier, with more crime and poverty. However, governments are fighting all these miseries by seriously considering jumping on the smart city bandwagon. The strategy is to utilize innovation and ability to solve social problems and use of Information Communication Technologies (ICTs) to improve this capacity. The intelligence that aids in the ability to solve the problems of these communities is linked to technology transfer, for when a problem is solved. In this sense, intelligence is an inner quality of any territory, place, or city or region where innovation processes are facilitated by information and communication technologies. What varies is the degree of intelligence, depending on the person, the system of cooperation, and digital Table 3.1 Shows the world statistics of population and Internet usage. World Internet usage and population statistics March, 2019 update World regions
Populations (2019 Est.)
Population % of world
Internet users, Dec 31, 2018
Penetration rate (%pop.)
Growth 2000e2019
Internet users (%)
Africa Asia Europe Latin America/ Caribbean Middle East North America World total
1,320,038,716 4,241,972,790 866,433,007 658,345,826
17.0 54.7 11.2 8.5
464,923,169 2,160,607,318 705,064,923 438,248,446
35.2 50.9 81.4 66.6
10,199% 1790% 571% 2325%
10.8 50.1 16.3 10.2
258,356,867
3.3
170,039,990
65.8
5076
3.9
366,496,802
4.7
345,660,847
94.3
219%
8.0
7,753,483,209
100.0
4,312,982,270
55.6
1095
100.0
Courtesy of https://www.internetworldstats.com/.
62
Chapter 3 The miraculous anatomy
infrastructure, and the tools that a community offers its residents. Take, for example, Tokyodit tops the population list and remains the world’s largest city with 38 million dwellers. Amazingly, Tokyo is rated as the third most impressive smart city on Earth. This is the secret: After months of rolling blackouts due to a lack of nuclear power, the need for the Japanese to innovate has never been greater. Japan’s biggest companies are behind the smart city revolution taking place around the globe and are using Tokyo as their proving ground. Panasonic, Sharp, Mitsubishi, and many other big names are working very hard to infuse smart technology into this massive city.
CEWPS component 6: the Smart Vaccine center The SVC is the “Marines” of CEWPS. It receives marching orders from CCC to perform its vaccination cervices to all the critical systems on the smart grid. One of the great contributions to humanity was the discovery of the vaccine. Without adaptive immunity, one-fourth of the human race would have been terminally ill. Because, with the exponential acceleration of technology, it is like a runaway train with no control. The whole world needs quality of life, not only smart cities. Digital Immunity will offer a similar contribution to the digital world. SV is one of the most fascinating services that technology could offer to win the battle against cybercrime and terrorism. The SV is built with two avant-garde and cutting-edge technologies: AI and NT. The SV was conceived with futuristic features that will be ready in a decade. SV has seven well-programmed components, as shown in Fig. 3.16, that work autonomically in any difficult battle situation against adversary nanobots, pretty much like the B-cells of the human body. The ingenious structure of the nano SV can easily roam inside the blood and nerve vessels, and surprisingly inside the smart city nanogrid as well. Nanotechnology also offers incredible advantages in the design of nano SV, which is the new generation of smart weaponry against cyber malware.
FIGURE 3.16 The ingenious structure of the nano SV can easily roam inside the blood and nerve vessels, and surprisingly inside the smart city nanogrid as well.
What is a critical infrastructure?
63
CEWPS component 7: the vaccine knowledge base The VAKB is the intelligent “pharmacy” that has all prescriptions of all possible vaccines that were manufactured for previous attacks. It works very closely with the causal reasoning engine. Further explanation will be provided in the next section.
CEWPS component 8: the virus knowledge base The VIKB is the repository that contains all the attack payloads, descriptions, source, and expected target. It works very closely with the causal reasoning engine. Further explanation will be provided in the next section. The vaccine and virus knowledge bases are critically important to the overall security of the smart grid, as shown in Fig. 3.17. There will be situations where the virus is not available in the knowledge base and its matching vaccine is not ready; in this case, the infected system will provide samples of the virus and a vaccine will promptly be fabricated for the rest of the systems on the grid.
FIGURE 3.17 The parallelism between virus reverse engineering (once it is caught for forensics) and vaccine processing is uncanny and fascinating. Using Bayesian visualization reasoning, CEWPS comes up with amazing predictions to vaccinate the critical systems before the attack spreads to them.
64
Chapter 3 The miraculous anatomy
CEWPS component 9: CEWPS smart Nanogrid We are now leaving the fourth generation of nanotechnology and entering the fifth one. We are going to achieve marvelous things, including nanosystems that will outperform our present computing hardware and software. Nano could replace the current technology, which sends data through metal lines, with metallic carbon nanotubes, which conduct electricity better than metal. When information is sent from one core to another, the outgoing electrical signal would be converted to light and travel through a waveguide to another core, where a detector would change the data back to electrical signals. Nanowires thus represent the best-defined class of nanoscale building blocks, and this precise control over key variables has correspondingly enabled a wide range of devices and integration strategies to be pursued. CEWPS is giving us a new generation of cybersecurity, which is exemplified in a robust and intelligent environment known as DI. One of the most advanced components of CEWPS is its smart NanoGrid. Fig. 3.18 shows how the city NanoGrid connects all the critical system in the city of Dubai to the Central Coordination Center (CCC) Because without the grid, we would not have digital immunity, and without digital immunity, we will not have a securely intelligent and early-warning predictive smart city. The CEWPS screen in Fig. 3.18 shows how the city of Dubai’s counties and subcounties have been mapped into a grid-centric screen with all the critical systems that control the major infrastructures in the city. The grid has two-dimensional coordinates to facilitate the location of any system. Each critical system is recognized by a location code and type of infrastructure. CEWPS was using infrastructure data from the city of Dubai during its development. Both the city and the SV CCCs are connected real time to the smart NanoGrid through intelligent autonomic nanoadapters, as shown later in Fig. 3.19.
FIGURE 3.18 CEWPS has a great feature that no other system has, which is a graphical real-time representation of the smart city nanogrid during an attack and how the SV nanobots(Smart Immunity vaccinators) are neutralizing the attackers.
The smart grid model
65
FIGURE 3.19 All city critical infrastructure systems and IoT devices will be connected to the autonomic adapter (AA). Artificial intelligence and nanotechnology will make this a reality to spread digital immunity in all smart cities.
The smart grid model During the design of CEWPS, we looked at the risk issue and how it can be best managed. We realized that the only way to be ahead of the malware curve is to stay away from conventional technologies and jump into the two leading domains of AI and NT. CEWPS and all its smart components can harness these two technologies to fabricate and assemble the digital immunity of the future. We wanted to create a replica of the city with a smart grid over it with a huge variety of connections; different types of devices, sensors, and meters; and selected critical infrastructures. The city would then be connected to a computerized city grid screen. All the coordinates of the devices in the city would be reflected in real time on the computerized screen. In case of a real attack on the city, the screen would show the location and what to do to eradicate the attack. If an attack hit the city, at a specific square of the nanogrid, the computer screen would also indicate the exact location of the attack and its type. Then, the central coordination center would dispatch an alert to all the subscribed citizens on the grid and at the same time, the SV nanobots would be rushing to the site of the attack.
Connectivity of critical systems to a city’s smart Nanogrid Shortly, smart cities will be equipped with two layers of special grids. The first one is the city nanogrid, and the second one is the digital immunity grid. Both grids will be made of special nanomaterials, such as nanotubes, nanowire, or nanomesh. Nanotechnology will give us a smart nanogrid that will be used
66
Chapter 3 The miraculous anatomy
for digital immunity. This futuristic wire with a 300-nm diameter will connect all the devices with meters and sensors to the city nanogrid. A special autonomic adapter will establish real-time connectivity, as shown previously in Fig. 3.19. The city’s critical infrastructure systems (CIS) will also be equipped with adapters to interface with both the city smart nanogrid and the SV nanogrid. The smart nanogrids will be the main “highways” to transport vaccination services, dispatch alerts, report attack outcome status, exchange S2S messages and broadcast important administrative instructions.
Anatomy of the autonomic adapter The autonomic adapter (AA) engine is the smart interface between the CIS and the smart grid. Its responsibilities include the following: • • • • • • •
Delivery of the vaccination services to the critical infrastructure system, also called the client system Sending out attack outcome status to the central coordination center Authentication of grid users Vaccination schedules and provisions Security bulletins and alerts Attack data from infected systems Connectivity between system to system
The autonomic Adapter (AA) engine includes seven functional components, as shown in Fig. 3.19. The effectors: As the name indicates, they describe what is being done due to an attack. Effectors receive response services from the CCC and pass them to the planner for implementation. The sensors: They are like the RADAR; they check all incoming signals (adversary and friendly) before the connect with the Knowledge gear. The analyzer: This component provides the mechanism to evaluate the situation (normal ver-sus attack) based on performance and security metrics and rules. The monitor: As this name implies, it monitors the sensors that provide real-time signals from the CIS according to the rules of the analyzer. The planner: This element packages a list of vaccination services into a workflow coming from the CCC and passes them to the executioner. The executioner: This mechanism receives the vaccination services from the planner, and functional components of the AA, which then gets converted into knowledge and can be retrieved by any of the four components for decision support. The knowledge gear: This area gathers activity sensor and effector data from all of the four passes them to the CIS.
A smart city is an idealistic hype Let us start with a realistic note: a smart city is an idealistic hype. Although the term is very attractive, it will not happen anytime soon. It is a moving target with so many uneasy-to-control variables. Not only do smart cities need inexhaustible approved budgets, but above all, they need a stable government led by credentialed visionary leaders. Building a smart city takes at least 20 years to set up a “base city.” No one started a smart city from scratch, which would be a futile endeavor, if not impossible. According to the United Nations, there are 196 capital cities of the sovereign on Earth. None of them is a true smart city. So, we may consider a smart city nothing but a cross-pollinated booming urbanite with some inherent beauty. It can be on the ocean, in the ocean, or in the middle of the desert. The term
The smart grid model
67
“smart city” (SC) is a bit ambiguous. Some people choose a narrow definitiondthat is, a city that uses information and communication technologies to deliver services to its citizens. My favorite definition of a smart city is the following: a booming metropolis with resilient and future-ready infrastructures, bundled with intelligent ICT to secure the use of its key resources, and managed by a team of smart citizens and smart government. However, the most influencing variable in the SC polynomial is its ability to secure itself from physical human-provoked attacks and catastrophes. A smart city should have a mechanism to defend itself, mitigate danger, and eliminate it. Protecting the critical infrastructures and key resources (CIKR) in the smart city requires a brandnew approach to cybersecurity and a new set of technologies. CEWPS was specifically designed to meet the crucial security requirements in smart cities. CEWPS is two generations ahead of the present AVTs. What are the CIKRs that smart cities must worry about? Fig. 3.20 visually describes the behavior of the system hardware that will help smart city administrators size the proper configuration. In addition, it gives another perspective of planning system capacity issuedwhich is neglected at the planning stage.
FIGURE 3.20 This graph shows the behavior of the smart city system from the capacity of the hardware. Sizing the proper hardware is critical to serve the present workload of the users. Initially, the system got loaded with the IoT users and their sensors, and then we added the residential power consumption, which needed a second upgrade. If the computing infrastructure (CPU, storage, and network) were sluggish, then residents would be discouraged to use important devices at home and at work. Upgrades take time and are expensive. The modernization of smart cities is driven by well-planned computing infrastructures (cloud, big data, Internet providers, security providers, and, above all, reliable connectivity).
68
Chapter 3 The miraculous anatomy
The National Strategy for Homeland Security has identified 16 critical sectors that are described in https://www.dhs.gov/topic/critical-infrastructure-security. As we learn more about threats, means of attack, and the various criteria that make targets lucrative for terrorists, this list will evolve. The critical infrastructure sectors consist of “Chemical Sector, Commercial Facilities Sector, Communications Sector, Critical Manufacturing Sector, Dams Sector, Defense Industrial Base Sector, Emergency Services Sector, Energy Sector, Financial Services Sector, Food and Agriculture Sector, Government Facilities Sector, Healthcare and Public Health Sector, Information Technology Sector, Nuclear Reactors, Materials and Waste Sector, Transportation Systems Sector, Water and Wastewater Systems Sector.” Smart cities should be characterized by optimum urban performance reflected in 16 critical sectors. However, smart cities are more than the sum of those sectors. We can say that an SC is a digitally intelligent city. In other words, it is the balanced hybrid mixture of networked infrastructures and human capital. CEWPS (call it the Holy Grail) would only qualify to protect the 16 critical sectors of the city.
Sample system performance screens on the city dashboard The CEWPS/SV system will have around 150 dynamic screens and 100 online reports categorized by component. Fig. 3.21 and Fig. 3.22 are the most crucial gauges for the CEWPS system. The first figure shows the system response as a function of the IoT volume. The second screen shows how much workload the system could handle. We need these two critical graphs.
FIGURE 3.21 This graph shows the criticality of ICT for new smart cities. The critical success factor here is the system capacity to handle the present and future workload. 2014 Copyright (MERIT CyberSecurity Group); All rights are reserved.
Appendices
69
System Utilization % Critical Zone
100 %
75 %
50 %
25 %
Acceptable Zone
New Communities New Applications
Alarm Zone Wireless Security Videos Commercial Alarms Medical Sensors
Threshold of danger specified as a Computer administrates Workload Wireless Security Videos IoT Connections Medical Sensors
Acceptable Workload Alarmed Zone 100,000
300,000
500,000
# Concurrent Internet of Things connections 700,000
1000,000
© 2014 Copyright [MERIT CyberSecurity Group]; All rights are reserved
FIGURE 3.22 This graph shows the workload utilization of smart city hardware. 2014 Copyright (MERIT CyberSecurity Group); All rights are reserved.
Appendices Appendix 3.A Glossary (extracted from MERIT cybersecurity library) Accommodation: A process to which Piaget referred in his theory of cognitive development, whereby an individual’s existing understanding is modified from a new experience. Adaptable hypermedia systems: Systems in which users can explicitly set preferences or establish profiles through filling out forms to provide information for the user model, which is then used to determine the presentation of information. Adaptable systems: Systems in which users can diagnose their progress and modify student/user models as needed. Adaptation model: The component of an adaptive hypermedia system that allows the system to modify information presentation so that reading and navigation style conform to user preferences and knowledge level and that specifies how the user’s knowledge modifies the presentation of information. Adaptive multimedia presentation: A type of content adaptation in which the selection of the presentation medium is based on the needs of the user but which does not yet allow for adaptation of individual elements of multimedia content. Adaptive navigation: Adaptive hypermedia techniques that modify the links accessible to a user at a particular time; link adaptation. Adaptive presentation: Techniques that modify the contents of a page based on the user model; content adaptation. Adaptive systems: Systems that modify the student/user model to adjust to the progress and characteristics of users.
70
Chapter 3 The miraculous anatomy
Adaptive text presentation: A type of content adaptation in which the user model determines a page’s textual content. Although there are various techniques for adaptive text presentation, they look similar from the perspective of “what can be adapted”; that is, those with varying user models see different textual content as the content for the same page. Adaptive tutoring systems: These are what many refer to as intelligent tutoring systems, though Fred Streitz is reluctant to ascribe intelligence to technical systems. Aesthetic entry point: A way to introduce a topic that engages the senses through works of art that relate to the subject matter being studied. In addition, concepts and examples have their aesthetic properties, which can be examined and discussed in conjunction with the topic at hand. Assimilation: A process of adaptation to interactions with the environment through which individuals add new experiences to their base of knowledge according to Piaget’s theory of cognitive development. Asynchronous learning: Directed study, or “self-study,” that does not occur in real time or in a live instructor-led setting. Asynchronous Web technology: A type of computer-mediated communication that involves the use of the World Wide Web to provide information in nonreal time. Behaviorism: A major school of thought on the nature of learning and the properties of knowledge that was dominant in the 1950 and 1960s and focused on the observation of behavior and the adaptation of organisms to the environment. Behaviorist learning theories view knowledge as objective, given, and absolute. Bodily kinesthetic intelligence: A biopsychological potential that involves using one’s body for processing information to solve problems or build products. Bug catalogue: A set of errors compiled and analyzed by an intelligent tutoring system to indicate where a particular learner is having difficulty. Categorization: The basis of a cognitive learning theory developed by Jerome Bruner, a cognitive psychologist and educator. According to Bruner’s theory, people interpret the world in terms of similarities and differences among various events and objects. Although engaged in categorizing, people employ a coding system based on a hierarchical arrangement of categories that are related to each other, with successively more specific levels. Classicism: An approach to modeling thinking from the field of cognitive science that employs symbolic processing to model thought processes, also called as Symbolicism. Cognitive constructivism: A school of thought within constructivism, that postulates that learning occurs as a result of the exploration and discovery by each learner. In the view of cognitive constructivists, knowledge is a symbolic, mental representation in the mind of each individual. Cognitive psychobiology: Interdisciplinary field of study involving biological neural studies. D.O. Hebb is considered by many to be the father of cognitive psychobiology. Cognitive science: The interdisciplinary study of mind and intelligence, which attempts to further an understanding of intelligent activities and the nature of thought. The major contributing disciplines to the field include philosophy, psychology, computer science, linguistics, neuroscience, and anthropology. Cognitivism: Major school of thought that employs an information processing approach to learning and uses a model based on the inputeoutput information processing architecture of digital computers. Although cognitivist learning theories are based on active mental processing on the part of learners, such theories still maintain the behaviorist perspective on knowledge, considering knowledge to be both given and absolute.
Appendices
71
Common gateway interface: A standard for external gateway programs to interface with information servers such as HTTP (Hypertext Transfer Protocol) servers used for the World Wide Web. Computational system: A term used in cognitive science to denote a system that uses discrete mathematics to model cognitive agents and the process of cognition. Computer-assisted (-aided) instruction (CAI): Usually refers to sequentially ordered “linear programs.” CAI generally follows a step-by-step procedural approach to the presentation of subject matter, based on the principles of behaviorist psychology. Computer-based instruction (CBI): Using computers for training and instruction. However, the term “CBI” usually refers to instruction that does not use technology from AI. Production rules and expert systems are generally not used for sequencing the elements of information that are presented to the student. This approach generally produces linear sequences of information, and such CBI programs are called as “linear programs.” Computer-based learning environments: Systems that use a constructivist approach based on Piaget’s theory of active learning, to provide an environment in which students can develop their authentic knowledge. Examples of computer-based learning environments are Papert’s Mindstorms and Lawler’s Microworlds. Computer-based training (CBT): Developed in the 1950s, CBT bases its training approach on behaviorist psychology theory. CBT “teaches” courses by presenting the knowledge to be learned through a step-by-step procedure, leading students from one item to be learned to the next. Computer-mediated communication: The passing of messages or sharing of information through networking tools, such as email, conferencing, newsgroups, and websites. Connectionism: An approach to modeling thinking developed in the field of cognitive science that views thought processes as connections between nodes in a distributed network. Constructionism: A term coined by an MIT researcher to connote the combination of constructivist learning theories with the creation and development of individually designed learning projects. Constructivism: Major school of thought on the nature of knowledge that views knowledge as a constructed entity developed by each individual. According to constructivist theory, information is transmitted but knowledge cannot be transmitted from teacher to student, parent to child, or any one individual to another; rather, knowledge is (re)constructed by each individual in his/her mind and is relative, varying through time and space. Constructivist learning theory: The theory, originally based on the research of Jean Piaget, that holds that learning is the result of an individual’s mental construction. The theory posits that individuals learn by actively constructing their understanding, incorporating new information into the base of knowledge they have already constructed in their minds. Content adaptation: Techniques that adapt the content of a page based on the user model; also known as adaptive presentation. Course management system: A type of online learning system, categorized in terms of its functions of content delivery, assessment, and administration. Deconstruction: An analytical method to uncover multiple interpretations of text developed by Jacques Derrida, a French philosopher, in the 1960s. Differential equations: A branch of mathematics used in dynamic systems theory to describe a multidimensional space of potential thoughts and behaviors, traversed by a path of thinking followed by an agent under certain environmental and internal pressures.
72
Chapter 3 The miraculous anatomy
Direct guidance: An adaptive navigation technique providing a link to the page that the system determines to be the most suitable next stop along the path to the user’s information goal. Usually provided via the “next” button, direct guidance offers a guided tour based on user needs. Domain model: The component of an adaptive hypermedia application that describes the structure of the information content of the application. The domain model specifies the relationship between the concepts handled by the application and the connection between the concepts and the information fragments and pages. Dynamic (dynamical) systems theory: The theoretical approach that uses differential equations to describe a multidimensional space of potential thoughts and behaviors, traversed by a path of thinking followed by an agent under certain environmental and internal pressures. Some cognitive scientists view dynamic systems theory as a promising approach to modeling human thinking. Educational adaptive hypermedia: One of the six major application domains within existing adaptive hypermedia systems. Educational hypermedia constitutes one of the earliest application areas and is still the most widely encountered application domain for adaptive hypermedia systems. Most educational hypermedia systems limit the size of the hyper-space by focusing on a specific course or topic for learning. User modeling-based adaptive hypermedia techniques are useful in educational hypermedia systems as knowledge level varies widely among usersdthe knowledge of an individual user can expand very quickly, and novice users need navigational assistance even in a limited hyperspace. Educational hypermedia: One of the six major application domains within existing adaptive hypermedia systems. Educational hypermedia constitutes one of the earliest application areas and is still the most widely encountered application domain for adaptive hypermedia systems. Most educational hypermedia systems limit the size of the hyperspace by focusing on a specific course or topic for learning. User modeling-based adaptive hypermedia techniques are useful in educational hypermedia systems as knowledge level varies widely among usersdthe knowledge of an individual user can expand very quickly, and novice users need navigational assistance even in a limited hyperspace. Entry point framework: An educational methodology that accommodates individual differences by providing multiple ways to introduce a topic. Although certain entry points activate particular intelligences, a one-to-one correspondence does not exist between entry points and intelligences. Epistemology: The branch of philosophy that studies knowledge and attempts to answer basic questions about knowledge, such as what distinguishes true or adequate knowledge from false or inadequate knowledge. Existential/foundational entry point: A way to introduce a topic that allows individuals to approach a topic through addressing fundamental questions, such as the meaning of life. Philosophical issues invite certain learners to engage on a deep level, which piques and holds their interest in studying a particular topic. Expert model: An intelligent tutoring system component that provides a representation of the knowledge in a way that a person who is skilled in the subject matter represents such knowledge. In recent intelligent tutoring systems, the expert model is a runnable program with the facility to solve problems in the subject matter domain. The expert model is used to compare the learner’s solution to the expert’s solution; in this way, the intelligent tutor identifies specific points that the learner does not yet understand or topics the learner has not yet mastered.
Appendices
73
Explanation variants: Content adaptation method that involves storing variations of sections of information and presenting each individual with the particular variation that best fits the individual’s user model. Formative evaluation: The evaluation of a working prototype or, in some cases, a rough draft of a system. g factor: The theory that there exists a single, monolithic, and measurable, general mental ability in humans, called as g. Generative topics: Topics that are central to one or more disciplines or subjects, accessible and interesting to students, as well as connected to teachers’ passions. Global guidance: A method for adaptive navigation support that helps the user follow the shortest and most direct path to reach the information goal by telling the user which link to follow next or sorting links from a given node according to their relevance to the overall goal. Global orientation: A method for adaptive navigation support, offering annotation landmarks and hiding nonrelevant information so that users understand the structure and position in hyperspace. Hands-on entry point: A way to introduce a topic that engages learners in constructing experiments with physical materials or through computer simulations. Other hands-on approaches invite learners to learn by building or manipulating a physical manifestation of some aspect of the topic they are studying. Hypermedia: Technology that focuses on information nodes and the connections between the nodes. Instrumentalism: Naturalistic understanding and philosophy that was developed by John Dewey based on the underlying belief that thought is the product of the interaction between an organism and the environment and knowledge, guiding and controlling the interaction between the organism and the environment. Intelligences: Biopsychological potentials for processing information, solving problems, and developing products valued by the culture in which the person resides. Intelligent computer-aided instruction (ICAI): (Sleeman and Brown, 1982): Sleeman and Brown consider intelligent computer-aided instruction to be the same as intelligent tutoring systems. Intelligent educational systems (IES): Systems that advise learners and treat them as collaborators rather than directing them in an authoritarian manner. IES provides learner models that can be inspected and modified by the learners themselves. Intelligent tutoring systems: An advanced form of ICAI and CBI that attempts to individualize instruction by creating a computer-based learning environment. The environment performs like a human teacher, working with students to indicate when they make errors, offering suggestions on how best to proceed, recommending new topics to study, and collaborating with students on the curriculum. Such systems should be able to analyze student responses and keep track of the preferences and skills of each learner, customizing materials to fit the needs of individual students. Interpersonal entry point: A way to introduce a topic that engages learners with each other so that they can interact, cooperate, work together, or alternately debate and argue with each other. Students learn from each other through group projects, in which each student contributes to the overall effort. Interpersonal intelligence: A biopsychological potential that involves a person’s ability to understand the intentions, motivations, and desires of other people and, therefore, to relate effectively with other people.
74
Chapter 3 The miraculous anatomy
Intrapersonal intelligence: A biopsychological potential to understand oneself and to construct an effective working model of personal capabilities and difficulties as well as to employ such knowledge for managing one’s life. Knowledge-based tutoring systems: Systems that incorporate knowledge about the subject matter, principles of teaching, characteristics of individual learners, and human-computer interaction. Learning management system: Online learning system categorized by function, similar to course management systems, which contain content delivery, assessment, and administration functions with an integrated view of all active courses, with assessment and goal tracking facilities. Learning objects: Components, lessons, modules, courses, or programs that are individually structured digital or nondigital entities, for use or reference in online learning systems. Legacy systems: Existing applications or systems within an organization that are not Web-based nor are they integrated with the Web. Linguistic intelligence: A biopsychological potential that involves the ability to learn and use spoken and written language to process information and achieve specific goals. Link adaptation: Adaptive hypermedia techniques that modify the links accessible to a user at a particular time; adaptive navigation. Local guidance: A method for adaptive navigation support that offers suggestions for the most relevant link to follow for the next step, based on the user’s preferences, knowledge, and background. Local orientation: A method for adaptive navigation support that helps users understand their location in hyperspace and nearby information, offering information about nodes available from the current location or limiting navigation possibilities, focusing on the most relevant links. Logical entry point: A way to introduce a topic that allows learners to deduce the cause and effect of certain occurrences and apply deductive reasoning to understand the relationships among various factors involved in the study of a particular topic. Logical-mathematical intelligence: A biopsychological potential that involves the ability to conduct a logical analysis of problems as well as scientific investigations and to conduct mathematical operations. Logical positivism: School of thought in the philosophy that was widely accepted in the early 1950s, that questioned the value of systematic inquiry into the operation of the mind. Marxism: The philosophy developed by Karl Marx (1818e83) that truth can be discerned by analyzing economic structures. Modernity: A period during the Enlightenment when the worldview was based on using rational, empirical, and objective approaches to discern the truth. Multimedia technologies: A number of different media-based technologies provide delivery services for online learning. These technologies include: live, streaming video, audio, and slides; on-demand prerecorded video and/or audio with accompanying graphics; browser-based Web conferencing combined with audio conferencing; interactive graphics, slide shows, audio and video clips, and Web pages. Multiple representations: An educational methodology that is used to convey the definitive aspects of an idea or topic, by modeling them through abstract or natural representation systems. The form of the representation may be closely tied to the physical subject, such as a photographic record, map, or chart, or may provide a formal model. Contrary to established approaches, Gardner argues for a family of representations rather than a single representation that is considered to be the best. Multiple representations allow students to choose elements from known reference areas to represent and model
Appendices
75
the new topic. The use of multiple representations allows students to understand on a deeper level through developing models of the new subject matter. Musical intelligence: A biopsychological potential that involves the ability to perform, compose, and appreciate musical patterns. Narrative entry point: A way to introduce a topic that engages students in learning through relating stories. Linguistic, intrapersonal, and interpersonal intelligences are activated through verbal storytelling, with additional intelligences activated through symbolic narrative forms, including movies and mime. Naturalist intelligence: A biopsychological potential that involves the ability to recognize and classify many species that constitute the flora and fauna of a person’s environment. Neopragmatist: A philosophical approach adopted by Richard Rorty, similar to the pragmatist view, based on the belief that as humans, we create ourselves and our worlds and that human understanding is based on our interpretation of the world through a variety of paradigms rather than on an objective structure of the mind. Numerical entry point: A way to introduce a topic that offers students who like to deal with numbers and numerical relations the opportunity to learn through measurement, counting, listing, and determining statistical attributes of the topic being studied. Ongoing assessment: Asks the question: How will you and your students know what they understand? Students reflect on their own learning experiences throughout the process, and there are multiple ways for students to demonstrate to the teacher and to themselves what they understand. Online learning: Educational technology using computer-mediated communication facilities that generally arise from the use of Internet and Web technology. Overlay model: The standard type of student model in which a student’s knowledge is considered to be a subset of that of a subject matter expert. A technique for student modeling that involves measuring the student’s performance against the standard of an expert’s model. Page variants: Content adaptation technique of fragment variants in which a fragment is an entire page. Multiple versions of particular pages exist and are selected based on variables in the user model. Users receive structurally different explanations of concepts based on user model attributes. Easy to implement, this technique offers a variant for each user stereotype. Papert’s principle: apert’s belief that major steps in mental growth are based on acquiring new ways to organize and use what a person already knows, not just on learning new skills. Perceptron: A system invented by Frank Rosenblatt in 1957 through research in connectionism with which Rosenblatt demonstrated learning by a machine when the Mark I Perceptron “learned” to recognize and identify optical patterns. Performances of understanding: Asks the question: What will students do to build and demonstrate their understanding? Students can build and demonstrate their understanding through presentations, portfolios, and other approaches to demonstrate to the teacher and to themselves what they have learned. Postmodernism: A philosophy based on a belief in the plurality of meaning, perspectives, methods, and values, and an appreciation of alternative interpretations. Postmodernists distrust theories that purport to explain why things are the way they are, believing in the existence of multiple truths based on various perspectives and ways of knowing. Pragmatism: A school of thought developed by William James (1842e1910) and adopted by John Dewey. Dewey then developed a theory of knowledge based on pragmatism that encompassed a view
76
Chapter 3 The miraculous anatomy
of the world as one in which active manipulation of the environment is involved throughout the process of learning. Primacy effect: A psychological effect that means that students are particularly apt to remember the starting point in a learning experience. Psychoanalytic movement: School of psychology begun by Sigmund Freud (1856e939), understanding an individual’s psyche through an examination of the unconscious. Self-directed learning: Self-paced, asynchronous online learning with the learner proceeding at his/her own pace through course materials. Situated learning: Instruction that places an emphasis on the context in which learning occurs and provides students with opportunities to construct new knowledge and understanding in real-life situations, thereby seeking to avoid the decontextualized nature of typical classroom learning. Social constructivism: A school of thought that stresses the collaborative efforts of groups of learners as sources of learning and considers the mind to be a distributed entity extending beyond the bounds of the human body into the social environment. Spatial intelligence: The biopsychological capacity to recognize and manipulate patterns in both wide spaces and confined areas. Stereotype user model: A model used to represent the user’s knowledge offering a quick assessment of the user’s background knowledge. Stereotype user models can be used to classify a new user and initialize the state. Structural linguistics: A model of language developed by Ferdinand de Saussure (1857e913), based on the belief that meaning comes not from analyzing individual words but from considering the structure of a whole language. Structuralism: A term credited to anthropologist Claude Levi-Strauss (1908e), who applied models of linguistic structure to the study of the customs and myths of society as a whole. Believing that individuals do not control the linguistic, sociological, and psychological structures that shape them and that can be uncovered through systematic investigation, structuralists moved away from the existentialist view that individuals are what they make themselves. Symbolicism: A school of thought in cognitive science that employs what is now called as classicism, using symbolic processing to model thought processes. Synchronous learning environments: Online learning systems that use audio or video conferencing (or a combination thereof) as their primary delivery modality to support live simultaneous interaction; similar to an in-person instructor-led classroom situation. Teaching for understanding (Tf U) framework: Educational methodology designed to assist teachers in course development. The starting point in teaching for understanding is to develop generative topics, topics that are central to the discipline, and understanding goals to provide focus to the instruction. Theory of multiple intelligences: The cognitive theory, developed by Howard Gardner, that each individual possesses multiple intelligences rather than one single intelligence. Based on evidence from psychology, biology, and anthropology, Gardner delineates criteria used to define eight specific human intelligences: linguistic, logical-mathematical, bodily kinesthetic, interpersonal, intrapersonal, musical, spatial, and naturalist. According to Gardner, these intelligences are both biological and learned or developed. Although everyone possesses these intelligences, individuals differ in which intelligences are more developed than others.
References
77
Thinking style: A preferred way of using a person’s abilities according to how the individual likes to do something rather than how well he/she can conduct a task. Through lines: Ideas that are developed across the curriculum. Understanding goals: What the teacher wants the students to learn; explicit and public goals that are focused on key concepts, methods, purposes, and forms of expression, as well as linked to assessment criteria. User-adaptive system: An interactive computer system that adapts itself to current users, employing a user model for adaptation purposes. User model: A component of an adaptive hypermedia application that represents such individual characteristics as the user’s preferences, knowledge, goals, and navigation history and may include observations of the user’s behavior while using the system. Web-based online learning: Educational technology using computer-mediated communication facilities based on World Wide Web.
References Sleeman, D., Brown, J.S., 1977. In: Intelligent Tutoring Systems. Academic Press, London ... A Structure for Plans and Behavior, American Elsevier, New York.
CHAPTER
Hacking DNA genesd the real nightmare
4
We all understand that the rooster’s crow doesn’t cause the sun to rise, but even this simple fact cannot easily be translated into a mathematical equation. dA lecture on Causality by Dr. Judea Pearl.
All of this is not to say I am against self-experimentation or treatment . What I am against are biohackers and sketchy companies misleading people into believing they have created cures for diseases or that cures could be created so easily. dJosiah Zayner, first person to publicly edit his own DNA.
If I were a teenager today, I0 d be hacking biology. Creating artificial life with DNA synthesis. That’s sort of the equivalent of machine-language programming. dBill Gates interview with WIRED magazine.
A glimpse of the bright future Here is my wild revelation: In the future, around 2030, there will be hundreds of the Internet existing pretty much like the galaxies in the universe today interconnected and in constant motion. Our present Internet has grown exponentially and has become bloated with a large quantity of junk big data. Here are some conservative numbers on the “Brownian motion” turmoil of our present cyberworld: The Internet has brought unprecedented change to societies across the world, when we consider some of the eye-popping statistics reported by www.internetlivestats.com: Google now processes over 40,000 search queries every second, which translates to over 3.5 billion searches per day or 1.2 trillion searches per year worldwide. Facebook now has over 2 billion monthly active users, and 1.15 billion of them use it every day, whereas Twitter has 328 million active users, generating over 500 million tweets every day, or nearly 200 billion a year. There are 1.3 billion YouTube users watching 5 billion videos every single day, and 300 h of videos are uploaded every minute of every day. Tim Berners-Lee’s first World Wide Web page flickered to life at CERN on December 20, 1990. It was the very first website to go live. The inaugural page was not truly public when it went live until August 1991, and it was not much more than an explanation of how the hypertext-based project worked. However, it is safe to say that this plain page laid the groundwork for much of the Internet as you know itdeven now, you probably know one or two people who still think the Web is the Internet.
Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00004-5 Copyright © 2020 Elsevier Inc. All rights reserved.
79
80
Chapter 4 Hacking DNA genesd the real nightmare
NetCraft webserver survey confirmed in October 2014 that the Internet milestone was achieved when 1 billion websites were active. Today, that number is 1.3 billion and rising at the rate of nearly 275,000 a day. Demand is ballooning. Today, there are 3.74 billion users plugged in to the Internet worldwidedroughly over half the population of the world. The new Internet, let us call it the Neo-Internet, will be fully AI-centric and immersed into nanotechnology with quantum and DNA computing. It will be better regulated and organized, and its contribution to biomedicine will be tremendous. Many diseases of today will not exist in 2035. We should kiss the hands of the laborious geneticists who are burning the midnight oil to develop new methods to eliminate gene mutation and replace it with healthy genes. The Neo-Internet will be designed with robust molecular smart digital immune systems to protect it from cyberterrorists, bioterrorists, and physical terrorists. We believe digital immunity ecosystem will move to a new molecular generation with more predictive and cognizant skills and awesome scalability. Antivirus technologies (AVTs) will become ancestry to remember.
Stuxnet is the devil’s key to hell Before we talk about DNA hacking, it will be appropriate to remember the devastation that a piece of software could bring to any country and how it could tear its societal fabric in no time (refer to Appendix A for the complete story.). Stuxnet could have been the apocalypse if it had not been discovered and stopped. A DNA version of Stuxnet could also impact life on the earth and cause misery to all mankind. It is now widely accepted that Stuxnet was created by the intelligence agencies of the United States and Israel. The classified program to develop the worm was given the code name “Operation Olympic Games”; it was begun under President George W. Bush and continued under President Obama. While neither government has ever officially acknowledged developing Stuxnet, a 2011 video created to celebrate the retirement of Israeli Defense Forces head Gabi Ashkenazi listed Stuxnet as one of the successes under his watch. Stuxnet is an extremely sophisticated computer worm that exploits multiple previously unknown Windows zero-day vulnerabilities enabling it to infect computers and spread. Its purpose was not just to infect PCs but also to cause real-world physical effects. Specifically, it targets centrifuges used to produce the enriched uranium that powers nuclear weapons and reactors. Fig. 4.1 shows the hellish plan to attack Natanz Uranium enrichment plant. The political polarity between the United States and Iran is not justified; nonetheless, it showed a glimpse of the cyber war. Several other worms with infection capabilities similar to Stuxnet, including those dubbed Duqu and Flame, have been identified in the wild, although their purposes are quite different than Stuxnet’s. Their similarity to Stuxnet leads experts to believe that they are products of the same development shop, which is apparently still active. The US and Israeli governments intended Stuxnet as a tool to derail, or at least delay, the Iranian program to develop nuclear weapons. The Bush and Obama administrations believed to unleash the malware. Stuxnet was never intended to spread beyond the Iranian nuclear facility at Natanz. The facility was air-gapped and not connected to the Internet. That meant that it had to be infected via USB sticks transported inside by intelligence agents or unwilling dupes, but also meant the infection should have
Stuxnet is the devil’s key to hell
81
FIGURE 4.1 While the individual engineers behind Stuxnet have not been identified, we know that they were very skilled and that there were a lot of them. Kaspersky Lab’s Roel Schouwenberg estimated that it took a team of 10 coders 2e3 years to create the worm in its final form. From MERIT CyberSecurity Library.
been easy to contain. However, the malware did end up on Internet-connected computers and began to spread in the wild due to its extremely sophisticated and aggressive nature, although as noted it did little damage to the outside computers it infected. Computerworld magazine reported on August 22, 2017, that Scientific community in the United States believed the spread was the result of code modifications made by the Israelis; then-Vice President Biden was said to be particularly upset about this. Liam O’Murchu, from the Security Technology and Response group at Symantec, was on the team that first unraveled Stuxnet. He says that Stuxnet was “by far the most complex piece of code that we’ve looked atdin a completely different league from anything we’d ever seen before.” And many websites claimed to have the Stuxnet code available for download; Symantec did not recommend any download. The original source code for Stuxnet, which was written by the United States and Israeli intelligence, has not been released. The code for one driver, a very small par of the overall package, has been reconstructed via reverse engineering, but that’s not the same as having the original code. However, the code could be understood from examining the binary in action and reverse engineering it. Stuxnet code was developed using Siemens equipment. And it was a thorough analysis of the code that eventually revealed the purpose of the malware. “We could see in the code that it was looking for eight or ten arrays of 168 frequency converters each,” says O’Murchu. “You can read the
82
Chapter 4 Hacking DNA genesd the real nightmare
International Atomic Energy Association’s documentation online about how to inspect a uranium enrichment facility, and in that documentation, they specify exactly what you would see in the uranium facilitydhow many frequency converters there will be, how many centrifuges there would be. They would be arranged in eight arrays and that there would be 168 centrifuges in each array. That’s exactly what we were seeing in the code.” “It was very exciting that we’d made this breakthrough,” he added. “But then we realized what we had got ourselves intodprobably an international espionage operationdand that was quite scary.” Symantec released this information in September of 2010. It was found out later that the Iranians had been having problems with their centrifuges and the plant decided to shut down. Alex Gibney, the Oscar-nominated documentarian behind films like Enron: The Smartest Guys in the Room and Going Clear, directed Zero Days, which explains the history of Stuxnet’s discovery and its impact on relations between Iran and the west. Zero Days includes interviews with O’Murchu and some of his colleagues and is available in full on YouTube. One dramatic sequence shows how the Symantec team managed to drive home Stuxnet’s ability to wreak real-world havoc: They programmed a Siemens PLC to inflate a balloon and then infected the PC that it was controlled by with Stuxnet. The results were dramatic: Despite only being programmed to inflate the balloon for 5 s, the controller kept pumping air into it until it burst. The destruction of the Iranian uranium centrifuges, which followed the same logicdthey were spun too quickly and destroyed themselvesdwas perhaps less visually exciting but was ultimately just as dramatic. As the documentary explains, we now live in a world where computer malware code is causing destruction at a physical level. It is inevitable that we will see more in the future.
The DNA stuxnet (DNAXNET) We all know the story of Frankenstein the monster, often erroneously referred to as the fictional character who first appeared in Mary Shelley’s 1818 novel FrankensteindThe Modern Prometheus. Shelley’s title thus compares the monster’s creator, Victor Frankenstein, to the mythological character Prometheus, who fashioned humans out of clay and gave them fire. In Shelley’s Gothic story, Victor Frankenstein builds the creature in his laboratory through an ambiguous method consisting of chemistry and alchemy. Shelley describes the monster as being 8 feet tall (2.4 m) and hideously ugly, but sensitive and emotional. The monster attempts to fit into human society but is shunned, which leads him to seek revenge against Frankenstein. DNA can turn into a formidable weapon of biblical proportion. All that it takes is an experienced geneticist, bioinformatician, and computer scientist to write the binary malware payload, convert it into DNA strands, and circulate it around the world through the Dark Web. Let us call it DNAXNETdthe new gene editor. Scientists liken it to the Find and Replace feature used to correct misspellings in documents written on a computer. Instead of fixing words, gene editing rewrites DNA, the biological code that makes up the instruction manuals of living organisms. With gene editing, researchers can disable target genes, correct harmful mutations, and change the activity of specific genes in plants and animals, including humans. Much of the excitement around gene editing is fueled by its potential to treat or prevent human diseases. There are thousands of genetic disorders that can be passed on from one generation to the
The DNA stuxnet (DNAXNET)
83
next; many are serious and debilitating. They are not rare: one in 25 children is born with a genetic disease. Among the most common are cystic fibrosis and muscular dystrophy. Gene editing holds the promise of treating these disorders by rewriting the corrupt DNA in patients’ cells. But it can do far more than mend faulty genes. Gene editing has already been used to modify people’s immune cells to fight cancer or be resistant to HIV infection. It could also be used to fix defective genes in human embryos and so prevent babies from inheriting serious diseases. The symbolic picture below shows the process of clipping mutated genes and replacing them with healthy ones. This is controversial because the genetic changes would affect their sperm or egg cells, meaning the genetic edits and any bad side effects could be passed on to future generations. The agricultural industry has leapt on the gene editing bandwagon for a host of reasons. With gene editing, researchers have made seedless tomatoes, gluten-free wheat, and mushrooms that do not turn brown when old. Other branches of medicine have also seized on its potential. Companies working on next-generation antibiotics have developed otherwise harmless viruses that find and attack specific strains of bacteria that cause dangerous infections. Meanwhile, researchers are using gene editing to make pig organs safe to transplant into humans. Gene editing has transformed fundamental research, too, allowing scientists to understand precisely how specificness operate. When the cell tries to fix the damage, it often makes a hash of it and effectively disables the gene. This, in itself, is useful for turning off harmful genes. But other kinds of repairs are possible. For example, the most prominent way to edit genes is through the molecular tool of CRISPR-Cas9 as shown in Fig. 4.2. It uses a guide molecule, gRNA, to find a mutated gene, which is then cut by an enzyme (Cas9). See source in Reference, to mend a faulty gene; scientists can cut the mutated DNA and replace it with a healthy strand that is injected alongside the CRISPR-Cas9 molecules. Genes are the biological templates the body uses to make the structural proteins and enzymes needed to build and maintain tissues and organs. Humans have about 20,000 genes bundled into 23 pairs of chromosomes, all coiled up in the nucleus of nearly every cell in the body.
FIGURE 4.2 The most prominent way to edit genes is through the molecular tool of CRISPR-Cas/9.
84
Chapter 4 Hacking DNA genesd the real nightmare
The letters of the genetic code refer to the m molecules (G, C, T, A). These “base pairs” become the rungs of the familiar DNA double helix. It takes a lot of them to make a gene. The gene damaged in cystic fibrosis contains about 300,000 base pairs, whereas mutation in muscular dystrophy has about 2.5 million base pairs, making it the largest gene in the human body.
Criminals could alter their DNA to evade justice with new genetic editing tools Criminals who are running from the law would do anything to hide their identity by editing their genes and changing the sequence of their base pairs. As it was mentioned in Telegraph News and Science magazine, a revolutionary genetic editing technique designed to repair faulty DNA could be used by criminals to evade justice, experts have said. The Internet has given us some pretty interesting services, such as software-as-a-service (SaaS), platform-as-a-service (PaaS), and crime-as-a-service (CaaS); we now can add the new service of DNA-as-a-service (DaaS) to the list. Some underground genetic medical clinics will advertise gene scrubbing and editing services on the Deep Web. This way criminals tracking will come to a screeching halt. Running away from justice can become a high-tech challenge. Bonnie and Clydedthe Depression-era Kardashians type of public fascinationdwould be the first to use such services.
DNA digital data hacking Just like in the case of Stuxnet, where the attack was maliciously political, DNA warfare can be justified to punish a “whole” nation for fabricated and baseless political or/and religious reasons. DNA hacking malware comes in two distinct flavors: The first kind is the gene hacking, used to modify DNA and create unpredictable Frankensteins, Terminators, and Cyborgs. The second kind is DNA digital data hacking, which we will cover in detail in Chapter 12 (DNA Data and Social Crime).
DNA crime In this chapter, we will focus on the first type of DNA crime, gene hacking. Atlantic magazine, in its November 2012 issue, printed an eye-popping, hair-raising article titled, “Hacking the President’s DNA.” Here are some excerpts from the article: The U.S. government is surreptitiously collecting the DNA of world leaders and is reportedly protecting that of Barack Obama. Decoded, these genetic blueprints could provide compromising information. In the not-too-distant future, they may provide something more as welldthe basis for the creation of personalized bioweapons that could take down a president and leave no trace . And according to a 2010 release of secret cables by WikiLeaks, Secretary of State Hillary Clinton directed our embassies to surreptitiously collect DNA samples from foreign heads of state and senior United Nations officials. Clearly, the U.S. sees strategic advantage in knowing the specific biology of world leaders; it would be surprising if other nations didn’t feel the same.
DNA, known as deoxyribonucleic acid, contains the genetic information that distinguishes each person. The human body contains 100 1012 cells, which include 100 1012 DNAs at 6 100 1012 feet, if stretched end to end.
Criminals could alter their DNA to evade justice with new genetic editing tools
85
The DNA molecule contains four components: adenine (A), cytosine (C), guanine (G), and thymine (T). DNA can be transferred in body fluids by talking, coughing and sneezing and by shedding skin cells. It has steadily become a more important source of physical evidence at a crime scene, especially in cases of sexual assault. Fig. 4.3 shows a simple scenario where the true perpetrator will emerge. Who knows how many people incarcerated with true evidence? You can praise DNA or curse it.
FIGURE 4.3 The forensics team collects samples from a crime scene. Suspects are pulled into the DNA lab and samples are collected from them. DNA analysis shows that 2 is the true attacker.
Gene hacking Amazon is selling a do-it-yourself (DIY) Bacterial Genome Engineering CRISPR Kit for $169 plus shipping as shown in Fig. 4.4. It is like giving a child a loaded gun and let him play in front yard. The GeneArt Genomic Cleavage Detection Kit is a fast, PCR-based method to measure how well your genome editing tool forms insertion and deletion (indel) mutations. Want to really know what this whole CRISPR thing is about? Why it could revolutionize genetic engineering? This kit includes everything you need to make precision genome edits in bacteria at home including RNA and DNA templates. This kit is packaged with great temptation for kids to try it on friends or neighbors or in school. But bioterrorists who know a lot about genetic engineering and CRISPR could have many diabolical methods far beyond any cyberterror. The history of biological warfare is nearly as old as the history of warfare itself. In ancient times, warring parties poisoned wells or used arrowheads with natural toxins. Mongol invaders catapulted plague victims into besieged cities, probably causing the first great plague epidemic in Europe, and British settlers distributed smallpox-infected blankets to native Americans. Today, nearly all countries have the technological potential to produce large amounts of pathogenic microorganisms safely. Classical biowarfare agents also can be made much more efficiently than their natural counterparts, with even the simplest genetic techniques. With modern biotechnology, it becomes possible to create completely new biological weapons.
86
Chapter 4 Hacking DNA genesd the real nightmare
FIGURE 4.4 To build linear, double-stranded DNA fragments constructed using synthetic DNA and that are ready for use in applications such as cloning, CRISPR-based genome editing, and antibody engineering. The box comes with an example experiment, so researchers learn the basics of many molecular biology and gene engineering techniques.
Concerns are also mounting that gene editing could be used in the development of biological weapons. In 2016, Bill Gates remarked that, “The next epidemic could originate on the computer screen of a terrorist intent on using genetic engineering to create a synthetic version of the smallpox virus.” More recently, in July 2017, John Sotos, of Intel Health and Life Sciences, stated that gene editing research could “open up the potential for bioweapons of unimaginable destructive potential.” In 2002, scientists at Stony Brook University recreated the polio virus from scratch based on its published genetic sequence. This demonstration prompted fears that terrorist organizations might exploit the same technique to synthesize more deadly viral agents, such as the smallpox virus, as biological weapons. There is parallelism between the advancement of DNA gene editing and the sophistication of bioterrorism. As we can see in cyberterrorism, cyberterrorists have mastered malware technology and have developed highly destructive cyberweapons. Bioterrorists could be professional genetic engineers who would be hired by adversary political and religious leaders. A ferocious biowar could take place in synthetic genomics between top-notch genetics gurus having “tacit knowledge” on both sides. Also, genome synthesis will be subject to a process of “deskilling,” a gradual decline in the amount of tacit knowledge required to master the technology that will eventually make it accessible to novice criminals with malicious intent. This debate is of more than academic interest because it is central to determining the security risks associated with the rapid progress of biological science and technology.
MyHeritage website leakage
87
Let us remember how cyberterrorism has beaten antivirus technology in sophistication and secrecy. Let us not forget how the US presidential election of 2016 was compromised by foreign adversaries. Deskilling has already occurred in several genetic engineering techniques that have been around for more than 20 years. In fact, a few standard genetic engineering techniques have been deskilled to the point that they are now accessible to undergraduates and even advanced high school students and could therefore be appropriated easily by terrorist groups. In May 2008, a group of amateur biologists in Cambridge, Massachusetts, launched another openaccess initiative called “DIYbio” (do-it-yourself biology) with the goal of making biotechnology more accessible to nonexperts. This included the potential use of synthetic biology techniques to carry out personal projects. DIYbio has since expanded to other US cities as well as internationally, with local chapters in Bangalore, London, Madrid, and Singapore. Although the group’s technical infrastructure and capabilities are still rudimentary, they may become more sophisticated as gene synthesis technology matures. Malware history will repeat itself. The parallelism syndrome will prevail, and gene editing tools will be available to the hands of “black-hatted” computer hackers, who will morph into “biohackers” to demonstrate their technical prowess in the synthesis and weaponization of bioviruses, bioworms, and bio-Trojans, to create a new me´lange of nanoevil that will create havoc for any occasion. Immersing malware into DNA, in a few years, will be a major challenge for chief security officers (CSO) and AVT vendors. DNA hacking will be hard to trace and eliminate. Here are some interesting stories on how “techno DNA hacking” has become the Shangri-La of academia. The biological malware was created by researchers led by Tadayoshi Kohno, who call it the first “DNA-based exploit of a computer system.” To carry out the hack, Kohno and his team encoded malicious software in a short stretch of DNA they purchased online. They then used it to gain “full control” over a computer that tried to process the genetic data after it was read by a DNA sequencing machine. To make the malware, the team translated a simple computer command into a short stretch of 176 DNA letters, denoted as A, G, C, and T. After ordering copies of the DNA from a vendor for $89, they fed the strands to a sequencing machine, which read off the gene letters, storing them as binary digits, 0s and 1s. The attack took advantage of a spillover (buffer overflow) effect, when data that exceed a storage buffer can be interpreted as a computer command. In this case, the command contacted a server controlled by Kohno’s team, from which they took control of a computer in their lab they were using to analyze the DNA file.
MyHeritage website leakage Founded in Israel in 2003, the site launched a service called MyHeritage DNA in 2016 that, like competitors Ancestry.com and 23andMe, lets users send in a saliva sample for genetic analysis. The website currently has 96 million users; 1.4 million users have taken the DNA test. A MyHeritage security officer received, in December 2017, a message from a researcher who unearthed a file named “MyHeritage” containing email addresses and encrypted passwords of 92,283,889 of its users on a private server outside the company. MyHeritage said that email addresses and password information linked to more than 92 million user accounts have been compromised in an apparent hacking incident. “There has been no evidence that the data in the file was ever used by the perpetrators,” the company said in a statement a couple of days later.
88
Chapter 4 Hacking DNA genesd the real nightmare
According to the MyHeritage website, the breach took place on October 26, 2017, and affects users who signed up for an account through that date. The company said that it does not store actual user passwords, but instead passwords encrypted with what is called a one-way hash (the encrypted message is called digest), with a different key required to access each customer’s data. Experienced hackers could successfully decrypt the hashed password into legible passwords and could access personal information when logging into someone’s account, such as the identity of family members. However, hackers could not easily access raw genetic information, since a step in the download process includes email confirmation. The company emphasized that DNA data is stored “on segregated systems and are separate from those that store the email addresses, and they include added layers of security.” So why would hackers want DNA information specifically? And what are the implications of a big DNA breach? One simple reason is that hackers might want to sell DNA data back for ransom, says Giovanni Vigna, a professor of computer science at UC Santa Barbara and cofounder of the cybersecurity company Lastline. Hackers could threaten to revoke access or post the sensitive information online if not given money; one Indiana hospital paid $55,000 to hackers for this very reason. But there are reasons genetic data specifically could be lucrative. “This data could be sold on the down-low or monetized to insurance companies,” Vigna adds. “You can imagine the consequences: One day, I might apply for a long-term loan and get rejected because deep in the corporate system, there is data that I am very likely to get Alzheimer’s and die before I would repay the loan.” Genetic testing sites are treasure troves of sensitive information. Some sites offer users the option to download a copy of their full genetic code, whereas others do not. But the full genetic code is not the most valuable information anyway. We cannot just read genetic code like a book to gain insights. Instead, it is the easy-to-access account pages with health interpretations that are most useful for hackers. MyHeritage does not offer health or medical tests, but many companies, such as 23andMe and Helix, do. And there are plenty of players interested in DNA: Researchers want genetic data for scientific studies, insurance companies want genetic data to help them calculate the cost of health and life insurance, and police want genetic data to help them track down criminals, like in the recent Golden State Killer case. Already, we lack robust protections when it comes to genetic privacy, and so a genetic data breach could be a nightmare. “If there is data that exists, there is a way for it to be exploited,” says Natalie Ram, a professor of law focusing on bioethics issues at the University of Baltimore.
Appendices Appendix 4.A The complete story of Stuxnet The stuxnet strategy of attack Stuxnet attacked Windows systems using an unprecedented four zero-day attacks. It is initially spread using infected USB flash drives and then uses other exploits and techniques, such as peer-topeer remote procedure call (RPC), to infect and update other computers inside private networks that are not directly connected to the Internet. Stuxnet is unusually large at half a megabyte in size and written in several different programming languages (including C and Cþþ), which is also irregular for malware.
Two attack strategies
89
Two attack strategies Increase the rotor speed: The virus worked by first causing an infected Iranian IR-1 centrifuge to increase from its normal operating speed of 1064 Hz to 1410 Hz for 15 min before returning to its normal frequency. Twenty-seven days later, the virus went back into action, slowing the infected centrifuges down to a few hundred hertz for a full 50 min. The stresses from the excessive, then slower speeds, caused the aluminum centrifugal tubes to expand, often forcing parts of the centrifuges into sufficient contact with each other to destroy the machine. Disturb the gas pressure: After painstaking analysis, we can now confirm that the 417 programmable logic controller (PLC) device attack code modifies the state of the valves used to feed UF6 (uranium hexafluoride gas) into the uranium enrichment centrifuges. The attack essentially closes the valves, causing disruption to the flow and possibly destruction of the centrifuges and related systems. In addition, the code will take snapshots of the normal running state of the system and then replay normal operating values during an attack so that the operators are unaware that the system is not operating normally. It will also prevent modification to the valve states in case the operator tries to change any settings during the course of an attack cycle. Natural uranium consists of three isotopes; the majority (99.274%) is U-238, while approximately 0.72% is fissile U-235, and the remaining 0.0055% is U-234. If natural uranium is enriched to contain 3% U-235, it can be used as fuel for light water nuclear reactors. If it is enriched to contain 90% uranium-235, it can be used for nuclear weapons. The worm virus then propagates across the network, scanning for Siemens Step 7 software on computers controlling a PLC. In the absence of both criteria, Stuxnet becomes dormant inside the computer. If both the conditions are fulfilled, Stuxnet introduces the infected rootkit onto the PLC and Step 7 software, modifying the codes and giving unexpected commands to the PLC while returning a loop of normal operations system values feedback to the users. Here is a little more detail for our advanced readers: Now that Stuxnet is running without the user’s knowledge; it can move on to its targets. The Windows machines in the Supervisory Control and Data Acquisition (SCADA) system communicate with the PLCs by way of a program called WinCC/PS7, or Step 7. This program essentially translates user commands into useful commands for the PLCs using a set of libraries. When Stuxnet is installed, it targets one of these libraries (s7otbxdx.dll), which contains the translations for reading and writing new processes for the PLC, among other things. It takes advantage of a zero-day exploit in the WinCC database (a back-door password that ships with the software) to give itself access to the database’s libraries. Stuxnet renames s7otbxdx.dll to s7otbxsx.dll and replaces the original library with a modified version. This modified library contains almost everything present in the original, but some commands are intentionally translated incorrectly. Fig. 4.5 shows the diagram to attack and replace the original Step 7 code, with Stuxnet code. Now U-235 is the only variety or isotope of uranium that goes bang. In other words, only U-235 can be used in a nuclear weapon or nuclear reactor. But U-235 and U-238 are chemically identical.
90
Chapter 4 Hacking DNA genesd the real nightmare
FIGURE 4.5 Step-7 software was compromised by replacing the original Windows control file and implanting 7otbxdx.dll (dynamic Link Library) that was loaded with Stuxnet malware code. The original control file was deactivated, and Stuxnet took over the operation of the centrifuges. From MERIT CyberSecurity Library.
How do you separate them? You combine uranium with the incredibly reactive gas fluorine to make a new gas, uranium hexafluoride. Then you pump this radioactive gas into a centrifuge and spin for week after week after week. Ever so slowly, the gas with the heavier U-238 gradually gets thrown to the outside wall of the spinning centrifuge, whereas the gas with the not-so-heavy U-235 is left behind in the center. But this central gas still has a lot of the heavier U-238 gas mixed in, so you take out that central gas and feed it into another centrifuge, spin it for a few more weeks or months, and repeat, over and over. Of course, to be really efficient, a centrifuge has to spin very quickly, so quickly that it is only a few percent away from self-destruction. The underground uranium centrifuge plant at Natanz in Iran had some 6000 centrifuges up and spinning by April 2008.
The crack in the door The Iranian Natanz centrifuge plant was not connected to the Interweb. So, the makers of Stuxnet instead sent their virus to five specific domains in Iran, via the Internet. These domains were somehow associated with the Natanz centrifuge plant. We still do no’t know who or what these domains were, whether they supplied hardware or software, or if they were contractors such as electricians, plumbers,
Damages
91
network specialists, or so on. But we do know that the Stuxnet virus easily entered the computers at these domains with no trouble at all, because it had four zero-day vulnerabilities and two genuine, but stolen, digital certificates. One or more of the workers at one or more of these domains then broke the air gap rule. He or she took a USB memory stick that had been used at the domain into the uranium centrifuge plant and used it in a computer there. Stuxnet was now inside the computers that controlled the ultrahigh-speed centrifuges that, in turn, separated out the desirable Uranium-235, the only isotope of uranium that is used in nuclear weapons or power plants. Once Stuxnet was inside the computers that controlled the spinning centrifuges, it made copies of itself and spread to all the computers on the internal network at the underground centrifuge plant. Then, the Stuxnet virus went deeper and looked for motors spinning at 1064 revolutions per second (the normal RPS) It sped up the uranium centrifuges from their normal 1064 revolutions per second to 1410 and kept them there for 15 min. This burst of overspeed created subtle damage in the bearings and structures of the centrifuges, which were already running at a speed close to critical. Over time, this damage would destroy the centrifuges. Then, Stuxnet went to sleep for 27 days. This time, when it woke up, it slowed down the centrifuges to an incredibly slow two revolutions per second and kept them at this speed for 50 min. The security of PLC has always been very low, because nobody thought they would be a target. But what if every traffic light, elevator, and water pump in your country suddenly stopped working? What if the tiny valves that let fuel into every gasoline and diesel engine suddenly stopped working? The United States’ and Israel’s execution of this plan was astounding. Although Stuxnet failed to satisfy its own standards of success in one regard, according to Sanger, the worm was never intended to travel outside Natanz’s isolated, air-gapped networks. But an error in the code caused the worm to replicate itself and spread when an Iranian technician connected an infected laptop computer to the Internet. Fortunately, the worm did not cause widespread damage because it was engineered to affect Iranian enrichment facilities only. Stuxnet 0.5 tampered with the valves that fed uranium hexafluoride into centrifuge groupings. By triggering the valves to prematurely open and close, there was a change in pressure, which in turn caused the gas to solidify and thus destroy the centrifuges and the sensitive equipment used to develop them.
Damages Stuxnet 1.0 was reported to have infected an estimated 100,000 computers, many of which were not even involved with the uranium enrichment program. The Natanz uranium enrichment plant had approximately 10,000 IR-1 (based on the Pakistani centrifuge called the P1); Stuxnet decommissioned 1000 centrifuges by introducing the code that makes the centrifuge run beyond the red line. Fig. 4.6 shows the configuration of the Natanz uranium plant. The two PLCs (Programmable Logic Controller) control and monitor the flow of data and heavy water in the reactors.
92
Chapter 4 Hacking DNA genesd the real nightmare
FIGURE 4.6 Stuxnet 1.0 was the later version, which took on a completely different strategy. This time the virus interfered with the computerized frequency converts that controlled the speed of the centrifuges during the enrichment process. By causing the centrifuges to spin at speeds from both extremes, there was permanent damage to key parts of the enrichment process. Source from MERIT CyberSecurity Library.
Historical background On May 9, 1979, a prominent Iranian Jew, Habib Elghanian, was executed by the new Islamic government shortly after the revolution. He was known throughout Iran at one time as the largest producer of plastic goods. He was a successful importer and industrialist with different manufacturing companies, the wealthiest Jew in Iran, and their community’s leader as well as a generous philanthropist to Iranians of all religions. Menachem Begin, then Prime Minister of Israel, who was the head of Zionist militant Irgun (which gave birth to the Mossad), took a solemn oath to avenge Elghanian’s blood. In May 2009, two representatives of the Mossad visited the Elghanian family in Glendale, California, and delivered documents stating that the first Stuxnet will be in honor of Habib Elghanian. The number 19790509 in hexadecimal coding was the password that triggered the first step of the infection routine in Stuxnet. June 4, 2008 was the kick-off date of the Stuxnet attack. It took 2 years to fabricate in Israel with the blessing of the CIA and the Institute of Terrorism in Israel. According to reliable news coming from the Institute of Terrorism in Philadelphia and Jerusalem, Stuxnet was designed by a top-notch team of four Israeli cryptographers, three Iranian Nuclear engineers, two German process engineers, five US cyber experts, and three US and one Israeli experts in intelligence and antiterrorism (a total of 18 professionals).
Historical background
93
According to the Jerusalem Post newspaper, the German security expert Ralph Langner reported in his analysis that Stuxnet contained two distinct “digital warheads,” specifically designed to attack the uranium enrichment plant at Natanz and the nuclear power plant in Bushehr. The Pakistan newspaper, The Nation, also mentioned that “Stuxnet was assembled by a highly qualified team of experts, involving some with specific control system expertise.” The team managed to get detailed information about SCADA from Siemens. They approached Siemens as engineers from Libya (which was very friendly with Iran). Two members of the team claimed that they had permission to visit the Bushehr plant. They also secretly got a complete set of the engineering drawings of the Bushehr plant. Afterward, three Iranian engineers were executed as spies. The team, then, succeeded in getting reliable contact inside Iran on the uranium enrichment plant. The team worked in Israel testing Stuxnet for 18 months before they were able to load and hand the USB to the agent inside Iran.
Appendix 4.B Glossary (extracted from MERIT CyberSecurity archives) Biohacking: A certain type of do-it-yourself-science; it has nothing to do with hacking computers. Biohackers are people with a strong interest in biology and biological processes who carry out experiments and do research based on scientific methods but not within the framework of traditional scientific institutions, universities, and companies. Biohackers build their own laboratories at home, or at so-called hackerspaces. They are interested in understanding their surrounding environment better and to find new perspectives on scientific issues. Many biohackers are connected in a global network and also smaller, local groups. The biggest DIY bioplatforms are diybio.org and sphere.diybio.org. However, biohacking is highly controversial, especially experiments involving one’s own body and with experimental substances on the body or the genetic modification thereof, e.g., pathogenic bacteria are critically viewed. Center for citizen science: Acts as an information and service point for researchers, citizens, and experts from different disciplines and cross-links the interested community within Austria and beyond. Citizen science: Describes the engagement of people in scientific processes who are not tied to institutions in that field of science. Participation can range from the short-term collection of data to the intensive use of leisure time to delve deeper into a research topic together with scientists and/or other volunteers. Although many volunteer scientists do have a university degree, this is not a prerequisite for participating in research projects. However, it is important that scientific standards are adhered to. This pertains especially to transparency with regard to the data collection methodology and the open discussion of the results (2016): Green Paper. Citizen Science Strategy 2020 for Germany) (GEWISS). By means of activities such as watching butterflies, photographing landscapes, or transcribing old archive documents, citizens can create new knowledge and contribute actively to shaping research. Citizen science award: Has been awarded by the Federal Ministry of Education, Science, and Research since 2015. It distinguishes excellent achievements of citizens who participate in selected Austrian research projects during a clearly defined period of time. In the first year, participation was reserved for schools. As of 2016, interested people of all age groups are invited to take part in the projects. Cocreation: Signifies the joint efforts of researchers and citizens to develop and conduct a research project together. Starting from the research question and the selection of a methodology to the collection, analysis, and interpretation of data, every step of the research process is done in collaboration. In citizen science different typologies exist. In a publication of R. Bonney and colleagues (2009), three types of projects were described: contributory, collaborative, and cocreative.
94
Chapter 4 Hacking DNA genesd the real nightmare
Community science: Research projects that are initiated by citizens, for example, to examine the water or air quality in their residential environment by means of scientific methods. In London, for example, locals started the (already completed) project “Royal Dock Noise Mapping,” to measure the noise level in their residential environment. The background of this initiative was the fact that, due to the proximity of the airport, people were very concerned about the increasing air traffic and noise and asked researchers for help. Crowdfunding: By means of crowdfunding, volunteers can support the development and implementation of scientific projects. This type of financing is also used in other sectors, for example, in music, film, literature, and/or IT. For example, a successful platform in the field of research is science starter. Crowdsourcing: The name for a method where internal subtasks or problems are outsourced to interested persons. In this way, the creativity/knowledge of the “crowd” and the “wisdom of the crowd” are used to find quicker and better solutions. Crowdsourcing can be used in citizen science and open innovation. One example for a platform that invites the Internet community to submit proposals for solution is incentive. DIY science: Do-it-yourself science refers to various grassroots activities that are embedded in the bottom-up movement. DIY projects use scientific methods but are not integrated in a traditional research institution, university, or company. They are conducted by associations, civil society organizations, or individuals. Biohacking is the most popular kind of DIY science. People with a strong interest in biology carry out experiments in their own laboratories to better understand their surrounding environment. The Manchester Digital Laboratory, Waag, or the Open bioLab Graz Austria (OLGA) are examples of biohacking grassroots activities. An interesting interview with biologist, science hacker, and community manager Lucy Patterson on this topic can be found in https://www. zentrumfuercitizenscience.at/en/glossary. Open access: It means free of charge and open access to digital scientific contents and information on the Internet and also free access to scientific literature and data. Open innovation: The opening up of rigid processes for input from outside; also regarded as an umbrella term for innovation and gain of knowledge in the fields of technology and fundamental research. In contrast to citizen science, open innovation does not focus on citizens but on processes. The term originally comes from the corporate sector. Open innovation strategy: After a 1-year-long process involving the public and stakeholders, a new open innovation strategy for Austria was introduced in the summer of 2016. It presents a vision for 2025 with three fields of action and 14 specific measures. Open science: It is frequently defined as an umbrella term that involves various movements aiming to remove the barriers for sharing any kind of output, resources, methods, or tools, at any stage of the research process. As such, open access to publications, open research data, open source software, open collaboration, open peer review, open notebooks, open educational resources, open monographs, citizen science, or research crowdfunding falls into the boundaries of open science (Gema Bueno de la Fuente, n.d.). Responsible science: Also known as responsible research and innovation (RRI) in the EU context, actively involves the civil society in research and innovation processes to handle current challenges more effectively and in accordance with the values, expectations, and needs of society. Austria, too, has included responsible science as an important element in the “Action plan for a competitive research area” of the Federal Ministry of Science, Research, and Economy (BMWFW). One of the first steps derived from this is the establishment of an Alliance for Responsible Science, which numerous institutions from science, research, education, and practice have joined already.
Suggested readings
95
Sparkling science: An Austrian research program in which pupils are directly involved in current research. Since the beginning of the program in 2007, the integration of schools as fixed consortium partners in all project teams has been a prerequisite for the support of research work; since 2014, the participation of interested schools from the whole of Austria in selected citizen science pilot projects is also supported. Storytelling: A method to better communicate by imparting experiences in the form of stories with specific characters and a clear plot. This method is especially suitable if you want to convey a specific topic, such as citizen science to a heterogenous audience or community in a comprehensible way. Storytelling can be practiced in workshops. Volunteered computing: In the case of volunteered computing, people put the idle processor power of their computers, laptops, or smartphones at the disposal of researchers. It can be used by research projects that require a lot of computing power by means of a special software (e.g., Boinc). Young Science Center: The Young Science Center for cooperation of science and schools offers all Austrian schools and research institutions a great number of possibilities to make contact with one another and to work together. The bundling of offers, projects, and contacts creates synergy between science and schools. Moreover, an active network of educational and research institutions is created. The central hub of the center is the Internet platform (www.youngscience.at).
Suggested readings https://www.thenewatlantis.com/publications/could-terrorists-exploit-synthetic-biology Biohackers https://www. wired.com/story/malware-dna-hack/. Citizen science strategy https://www.researchgate.net/./303804239_Green_Paper_Citizen_Science_ Strategy_.. Criminals can alter their DNA https://www.telegraph.co.uk/science/2018/05/05/criminals-could-alter-dna- evadejustice-new-genetic-editing/. Crispr immune system Elsevier https://www.sciencedirect.com/science/article/pii/S0300908415001042. David Wall, “Cybercrime”, Polity Press, 2007. DNA Hacking https://www.telegraph.co.uk/science/2018/05/05/criminals-could-alter-dna-evade-justice-newgenetic-editing/. Eric Topol, “Deep Medicine”, Basic Books Publishing 2019. Gema Bueno de la Fuente http://learn-rdm.eu/en/ author/gema/. Genetic engineering and biological weapon https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1326447/Global hacking https://www.globalresearch.ca/the-truth-about-the-global-hacking-industry/5620279 Golden State Killer Josef DeAngelo https://www.nbcnews.com/news/us-news/dna-clears-accused-golden-state-killerjoseph-deangelo-1975-murder-n956566. Hackers and DNA tricks https://www.nature.com/articles/d41586-018-05769-8. Hacking president’s DNA https://www.theatlantic.com/magazine/archive/2012/11/hacking-the-presidents-dna/ 309147/?single page¼true. Hacking the President’s DNA https://www.theatlantic.com/magazine/archive/2012/11/a-brief-history-of-bravethinking/309137/. John Soto http://theconversation.com/could-gene-editing-tools-such-as-crispr-be-used-as-a-biological-weapon82187. Michio Kaku, “The Future of Humanity”, Doubleday Publishing, 2018. Nikhil Buduma, “Fundamentals of Deep Learning”, O’Reilly Media, Inc., 2017. R. Bonney https://www.researchgate.net/publication/220879569_Dynamic_Changes_in_Motivation_in_Collaborative_ Citizen-Science_Projects. Responder Professional Malware Analyzer http://www.datadev.com/responderpro.html?viewfullsite¼1 Zoya Ignatova, Karl Heinz Zimmermann, “DNA Computing Models”, Springer, 2008.
CHAPTER
The digital universe with DNAdthe magic of CRISPR
5
DNA neither cares nor knows. DNA just is. And we dance to its music. dRichard Dawkins.
Of course, we’re all programmed genetically to some extent. But the “selfish gene” thesis doesn’t explain everything. dJane Goodall, English primatologist and anthropologist.
For your own information DNA is not “coded”; DNA is the “code.” Think of a DNA molecule as a digital memory device, just like a USB drive. DNA is an incredible molecule that can be used to store binary numbers (bits, zeros, and ones) sequentially instead of just storing genes. DNA stores a sequence of binary digits, but instead of a pure binary code (base 2), it uses a quaternary code (base 4). Here are six weird but true facts about DNA. 1 Your DNA could stretch from the earth to the sun and back approximately 600 times. If unwound and linked together, the strands of DNA in each of your cells would be six feet long. With 100 trillion cells in your body, that means, if all your DNA were put end-to-end, it would stretch over 110 billion miles. That is hundreds of round trips to the sun! 2 We are all 99.9% alike. Of the 3 billion base pairs in the human genome, only 0.1% is unique to us. While that 0.1% is still what makes us unique, it means we are all more similar than we are different. 3 Genes make up only about 3% of your DNA. Genes are short segments of DNA, but not all DNA are genes. All told, genes are only about 1%e3% of your DNA. The rest of your DNA controls the activity of your genes. 4 A DNA test can reveal you are more Irish than your siblings. Your sister could be much more Irish than you. And this is true for any of over 350 regions covered by the AncestryDNA test. So your sibling could also be more (or less) British, Nigerian, or Scandinavian than you. 5 The human genome contains 3 billion base pairs of DNA. DNA molecules are shaped like twisted ladders. The rungs on that ladder are made of basesdadenine (A), cytosine (C), guanine (G), and thymine (T)dlocked together in pairs with hydrogen bonds. The really cool part is that they pair up in a very specific way: A always pairs with T, and C always pairs with G. 6 Your DNA could link you to places you would never imagine. Genetics has the power to tell you things you never dreamed of knowing, from just the DNA in your saliva Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00005-7 Copyright © 2020 Elsevier Inc. All rights reserved.
97
98
Chapter 5 The digital universe with DNA
What is our digital universe? Merriam Webster offers us the following definition of “universe”: The whole body of things and phenomena observed or postulated. A systematic whole held to arise by and persist through the direct intervention of divine power. The world of human experience. All the galaxies and aggregate of stars. We are providing a complementary description of our “digital universe”: All human activities that interface with electronic machines, devices, motors, sensors, computing boxes, mobile devices, and medical implants. All the content that was created by human activities, intelligent, and smart machines, including security cameras and sensors. All the information that goes into processing. All the information that comes out of processing. All transient unstored information, temporary stored, and permanently archived information. Human activities transform into information with one awakening reality: Data storage will suffocate from the lack of space. Fig. 5.1 represents the suggested ecosystem of our simple digital universe.
FIGURE 5.1 This is a simple model of our digital universe as it processes information in nine sequential steps, indicated by the numerical sequence. The exponential gluttony of data will bring major changes (some good and some bad) to our societal landscape. As our digital universe continues to expand, three fundamental realities will emerge: There is a great need for information security, sustainability, and higher and better storage capacity and durability.
How big is our digital universe?
99
How big is our digital universe? No one really knows, with hard, concrete evidence, the size of our celestial universe and the number of galaxies and stars. The same is true with our digital universe. Measuring our digital universe is a great challenge. We do not have standards or metrics to implement. But we have some credible educated estimates that we use in the book. We appreciate the figures provided by International Data Corporation (IDC), Gartners, and Forrester and thank them for their gracious contribution. Our digital universe is a moving target, expanding and accelerating dynamically in every direct ion every second. We are immersed in it and have developed a dreadful affinity to be part of it. Fig. 5.2 shows the asymptotic explosion of information consumption by the population of the world. Theoretically, a digital universe can be represented as a stretched polynomial with hundreds of unknown, or fuzzy, variables. With the rise of Big Data awareness and analytics technology, the digital universe lives increasingly in computing clouds, above a terra firma of vast hardware data centers linked to billions of distributed devices, all governed and defined by increasingly intelligent software. In December 2012, the research firm IDC published its sixth annual study titled, “The Digital Universe in 2020,” which includes interesting findings: - From 2009 to 2020, the digital universe will grow by a factor of 300%, from 130 exabytes to 40,000 exabytes (1 exabyte ¼ 1 1018). From now until 2020, the digital universe will about double every 2 years.
FIGURE 5.2 3.7 billion users are consuming incredible amounts of CPU cycles to get the right information at the right time and in the right format, but the nonlinearity of storage will soon reach its own asymptote. The DNA storage paradigm will create a seismic change in the near future.
100
Chapter 5 The digital universe with DNA
- The spending on IT hardware, software, services, telecommunications, and staff that could be considered the “infrastructure” of the digital universe, and telecommunications will grow by 40% between 2012 and 2020. 3.9 billion humans who now use the Internet create around 2.5 quintillion bytes (2.5 1018) of data every day, which would fill 10 million Blu-ray discs. These discs, when stacked one on another, would measure the height of four Eiffel Towers. With a growing number of smart devices connected to the Internet of Things (IoT), that figure is set to increase significantly. The world’s technological capacity to store information grew from 2.6 exabytes (1018) in 1986 to 15.8 1018 in 1993, over 54.5 1018 in 2000, and to 295 1018 (exabytes) in 2007. Piling up the imagined 404 billion CD-ROMs from 2007 would create a stack from the earth to the moon. Fig. 5.3 shows how digital storage is heading to the north pole. There is no escape from it. Ten years from now, CD storage will be a historical legacy.
FIGURE 5.3 This phenomenal clearly shows how information storage has evolved from one technology to a more compact one. It started with an information explosion, which forced the storage manufacturers to design a higher scalable product. In turn, it forced the hardware manufacturers to build computers to accommodate the storage products. Windows, UNIX, and Linux designers had to retrofit their operating systems to accommodate the new storage devices.
Why is the digital universe growing so fast?
101
Why is the digital universe growing so fast? In 3000 BCE, the Sumerians in Mesopotamia were the first to develop writing. This was the first drive of human beings to store, manipulate, and communicate information. In purview of today’s meaning of information technology (IT), we can say that IT has been growing and developing from four basic ages. They are premechanical, mechanical, electromechanical, and electronic where IT started as a management process of computer-based information systems. At the dawn of time, when humans exchanged a stone for wheat, they used sticks and stones to keep count. Humans have always counted on information to understand, identify, store, and use various devices to revive this information and use it for computation. This was the beginning of IT. There was no turning back, and it has grown manifold from databases to data storage, from data retrieval to data transmission and data manipulation. Today, IT is far beyond computers or electronic tools. It is the way of life. It is immersed in our societal fabric, and we have developed a tremendous affinity to accept it in our lives. It is all about communicating in real time and making sense of the data to fulfill purposes for the larger good of society. It has become an indispensable part of humankind’s lifestyle. A person who is not technology savvy is a rarity in today’s fast-paced globalized era. A person from the old Roman age would not be able to survive today, simply because he or she would not be able to communicate and would be in a total state of hysteria. The one thing that is driving IT in leaps and bounds is the dynamic state of communication. The best way to sum up the future of IT is: “Change is the only constant in this world.” The snowball of IT is rolling downhill and bringing with it disruptive change. The famous scientist Charles Darwin eloquently once said: “It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is most adaptable to change.” Peter Drucker, the father of modern management reminded us, “The past is not important, the futurity of the present is.” We are under a tremendous gravitational force of IT. We will ride the wave of disruptive technologies that drive our survival, to effect new experiences and greater business opportunities. Let us list some of the driving factors that will influence the expansion of our digital universe and, at the same time, impact the capacity of our data storage. Let us not forget that by 2025 nearly 20% of the data in our digital universe will be critical to our daily lives and nearly. 10% of that will be hypercritical. Here is how eloquently the white paper from IDC “Data Age 2025” described the future landscape of storage: •
•
Cognitive/artificial intelligence (AI) systems that change the landscape: The flood of data enables a new set of technologies, such as machine learning, natural language processing, and AIdcollectively known as cognitive systemsdto turn data analysis from an uncommon and retrospective practice into a proactive driver of strategic decision and action. Cognitive systems can greatly step up the frequency, flexibility, and immediacy of data analysis across a range of industries, circumstances, and applications. The IDC estimates that the amount of the global data universe subject to data analysis will grow by a factor of 50 to 5.2 zettabytes (ZB) (5.2 1021) in 2025; the amount of analyzed data “touched” by cognitive systems will grow by a factor of 100e1.4 ZB (1.4 1021) in 2025. Embedded systems and the IoT: As stand-alone analog devices give way to connected digital devices, the latter will generate vast amounts of data that will, in turn, allow us the chance to renew and improve our systems and processes in previously unimagined ways. Big Data and metadata (data about data) will eventually touch nearly every aspect of our lives, with profound consequences. By 2025, an average connected person anywhere in the world will interact with connected devices nearly 4800 times per daydbasically one interaction every 18 s.
102
•
•
Chapter 5 The digital universe with DNA
Mobile and real-time data: Increasingly, data will need to be instantly available whenever and wherever anyone needs it. Industries around the world are undergoing “digital transformation” motivated by these requirements. By 2025, more than a quarter of data created in the global datasphere will be real time in nature, and real-time IoT data will make up more than 95% of this. Cybersecurity as a critical foundation: All these data from new sources open up new vulnerabilities to private and sensitive information. There is a significant gap between the amount of data being produced today that requires security and the amount of data that are being secured, and this gap will widenda harsh reality of our data-driven world. By 2025, almost 90% of all data created in the global datasphere will require some level of security, but less than half will be secured.
Data storage capacity is becoming asymptotic Here is an interesting story about data storage: In 1985, Oracle 5.1 was marketed with 31 3½-inch disks, whereas 5¼-inch disks had virtually disappeared, as they became the predominant floppy disk. The advantages of the 3½-inch disk over the 5¼-inch were their higher capacity, smaller size, and their rigid case, which provided better protection from dirt and other environmental risks. However, if any of the floppies were defective, we would have to restart the installation from square one. The installation stretched from 45 min to 3 h. The same agony came with the UNIX and Windows NT software installation. Then, the CD and Blu-Ray generation came along, and it became much more ergonomic and faster. Today, online downloading to PCs, tablets, and mobiles has become the accepted protocol for software acquisition. “Technology gives and technology takes” has become the accepted universal canonical rule for IT renovation. Fig. 5.4 shows that by 2020, storage will surface as not only technology problem but also a grave business curse.
FIGURE 5.4 We still have a few years before we find ourselves stuck in the storage bottleneck. Even if we decide the take the DVD with 9.2 GB of double-sided storage, and pack 20 GB, it will not solve the storage crisis. Merit CyberSecurity Engineering, Data Courtesy International Data Corporation. March 2017.
Dubai is the magical smart city with its Achilles’ heel
103
Dubai is the magical smart city with its Achilles’ heel Anyone who has visited this young city can testify that it does have some magical charm. In line with its reputation for over-the-top glitz, Dubai, on receiving the award of the Expo 2020, lit the world’s tallest tower with glimmering lights. The skies around the Burj Khalifa, which towers at 2717 feet, erupted with fireworks. It is not luck that Dubai has calculated to the front in terms of progress and modern city planning. The success of Dubai is primarily due to the vision and sound planning of its ruler His Highness Sheikh Mohammed bin Rashed. If you want to market a product, go to Dubai. If you want to have a conference, set it up in Dubai. If you want to have an exhibit, then do it in Dubai. A spending spree was already underway even before officials announced the host city. Dubai estimates that a successful Expo 2020 bid will generate $23 billion between 2015 and 2021 or 24% of the city’s gross domestic product. They say total financing for the 6-month-long event will cost $8.4 billion. In a statement after the vote, Dubai ruler and UAE Vice President Sheikh Mohammed bin Rashed promised to “astonish the world” in 2020. “Dubai Expo 2020 will breathe new life into the ancient role of the Middle East as a melting pot for cultures and creativity,” he said. The announcement made in 2013 came just days before the 2 December UAE national day, which celebrates the young nation’s 42 years of unity and independence from the British Protectorate. Huge plans are underway. 140 more hotels are expected to open in Dubai before the Expo starts (around 460 now serving 118000 rooms and 10 million tourists). Jobs and vacancies are advertised even on the Emirates flights, who are also preparing to increase their fleet and routes, and a new airport has already been built, which is not yet operational. Dubai is impressive with its unique building infrastructures. Let us describe few: • • • • • • • •
Burj Al Arab The Palm Islands: Jumeirah Palm Island Burj Khalifa The Atlantis Palm Hotel and Jumeirah Beach Residence Dubai Marina and Dubai Aquarium. Infinity Tower and the world islands Dubai Mall and Dubai Fountain Dubai Electricity and Water Authority where electric power is drawn from desalinization technology
It is the city of wonders and paradise for tourists, and Lamborghinis for police patrol. According to Wikipedia, “Dubai is thought to have been established as a fishing village in the early 18th century and, was by 1822, a town of some 700e800 members of the Bani Yas tribe. In July 2019 the city of Dubai reached 3.331 million.” On December 2, 1971, Dubai, together with Abu Dhabi, Sharjah, Ajman, Umm al-Quwain, and Fujairah, formed a union and became later the United Arab Emirates (UAE). The UAE has a Minister of Happiness, a Minister of Tolerance, Minister for Youth, and Minister of Artificial Intelligence. Dubai is accelerating on the leading edge of technology and leading the world with its glamor and democracy of its leadership. We were able to get some interesting datadas shown in Fig. 5.5, from Dubai Chamber of Commerce, which is at the same level of ministry. Statistics show the dynamics of the city and where it is heading with the deluge of generated data.
104 Chapter 5 The digital universe with DNA
FIGURE 5.5 The Dubai Chamber of Commerce maintains all the up-to-date operating data for the city. The explosive growth of business is manifested in the diagram. In 2017, the city of Dubai had 51,312 registered public and private companies.
Dubai embracing DNA storage
105
Dubai embracing DNA storage With the incredible affinity of information, Dubai has the proper vision to migrate to DNA storage. The University of Dubai in collaboration with the Dubai Electronic Security Center (DESC) is looking into DNA digital storage as the new storage alternative. A prototype of DNA storage will be built and demonstrated at the Expo 2020. We have two interesting graphs that show projections of information usage, conventional storage, and potential DNA storage. In the first graph shown in Fig. 5.6, we have four indicators worth explaining: • • • •
Dubai is very rich in its critical systems, which are increasing by 14% on annually basis. Dubai population is increasing by 11% including new company employees and expatriates. Dubai heavily uses IoT sensors for traffic and crime control. The demand for storage of operating data is steadily increasing by 24%.
FIGURE 5.6 The graph shows the two critical indicators that are closely correlated. The more the users, the more the storage is required.
106
Chapter 5 The digital universe with DNA
In the second graph shown in Fig. 5.7, we have the following indicators: 1. Dubai demographic is a crucial metric of popularity, although the cost of living is skyrocketing. 2. Dubai demand of data storage is on the negative side. Dubai consumes astronomical amount of magnetic storage and has become a major concern for the city. 3. The storage criticality is increasing by 37% every year and is going out of control. It needs a better technosolution such as DNA. 4. Dubai digital storage requirement is ballooning, and it has reached the critical mass level.
FIGURE 5.7 The increasing demand for storage, while the DNA graph is moving slowly compared with the conventional storage demand.
The hyper data center of the world Here are some figures and analogies to give you a better picture of the necessary size of a sacred storage facility: the size of a large data center is 7.2 million ft2, which is comparable with 126 football
Five types of data centers
107
fields as shown in Fig. 5.8. A large data center usually consumes 650 MW of data. If we assume 1 MW can power 1000 homes, then a city of 80,000 people can use 45 MW of power. This data center, then, has the power to supply a city of 500,000 people.
FIGURE 5.8 There will be more than 390 hyperscale data centers in the world by the end of the year. This monstrous hyperscale data center is as long as a large runway. We can pack 2 million servers and 6 million storage racks under this roof, burning electricity that can serve 1 million cities people. We need to think harder! https:/www. datacenterdynamics.com/news/synergy-number-hyperscale-data-centers-reached-430-2018/.
Five types of data centers According to the December 2012 IDC report, in 2017, the world will have an estimated 8.4 million data centers (3 million in the United States alone), which are characterized by massive servers and a gargantuan arsenal of RAID storage racks. We classified the data centers into five categories, based on size and scalability: 1 Enterprise data center: This is the one that most of us are familiar with. It was owned and built by the end user to house its data center. Many companies still have their data center on the corporate campus, and there are a handful of companies building one for their own use. This is not the most proficient use of precious capital. Included in this category are the Internet service providers such as Earthlink and Network Solutions. 2 Managed service providers (facility management): This was the earliest form of outsourcing, which occurred when companies hired AT&T, IBM, and HP. A good example is EDS, where the company will manage the data center for the client. The facility management type of arrangement is no longer popular.
108
Chapter 5 The digital universe with DNA
3 Colocation data center (leasing): In the “old days,” it was called a service bureau, like the UCC. Companies would lease the data center for its use as a platform as a service (PaaS) and software as a service (SaaS). 4 Social media data center: This is used for retail and social networking such as Facebook, Twitter, Google, Apple, and Yahoo. 5 Cloud data center: This is the most popular type of business and enterprise computing. The popular providers include Amazon (AWS), IBM (SoftLayer), and Microsoft (Azure). Many of us are using the cloud on a regular basis and do not even realize it.
The top 10 hyperscale data centers On February 16, 2017, Rack Solutions published a report titled, “The Top 10 Largest Data Centers in the World.” The list looks at the 10 largest data centers that are either active, or being built, today. The list goes by total square footage of data center floor space. In fact, we can say that in few years, hyperscale data centers will not be a glamorous term anymore. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
The CitadeldTahoe Reno, Nevada, 7.2 million ft2 Range International Information GroupdLangfang, China, 6.3 million ft2 Switch SUPERNAP data centerdLas Vegas, Nevada, 3.5 million ft2 DFT (Dupont Corporation) data centerdAshburn, Virginia, 1.6 million ft2 Utah data center (National Security Agency)dBluffdale, Utah, 1.5 million ft2 Microsoft data centerdWest Des Moines, Iowa, 1.2 million ft2 Lakeside Technology CentereChicago, Illinois, 1.1 million ft2 Tulip data centerdBangalore, India, 1 million feet square QTS Metro data centerdAtlanta, Georgia, 990,000 ft2 Next Generation Data EuropeeWales, UK, 750,000 ft2
Each one of these huge data centers employs over 5000 people to support 1 million transactions per minute running on 1 million servers. The center, on the average, has over 1.5 million concurrent users. Panoramic view in Fig. 5.8 gives us an idea on the enormity of the structure to accommodate the exponential workload of applications that are resource intensive. There are hyper data centers that coexist with exponential growth of users for communication, education. Social media and for business criticality depend on these data centers: • • • • •
Facebook has 1 billion monthly active users. Yahoo has more than 500 million users worldwide. Twitter has more than 500 million active users around the world. Google users conduct over 4.72 billion searches per day. Google has more than 400 million active users. Apple iCloud has 190 million users.
By 2055, the digital universe will grow to 161 zettabytes (161 1021), and DNA-based data storage will become a reality. Smart cities are not smart unless they break away from traditional security software and take advantage of AI and nanotech technologies. DNA-based data storage is the only viable technosolution that will accommodate this hyper data explosion. Dubai, as a leading smart
How did DNA digital storage start?
109
city, will pioneer this new solution to data growth. Digital Immunity Ecosystem DIE is a very challenging “molecular” solution that could initially start in Dubai and trickle out into the rest of the world. Since the dawn of modern computers, the rapid digitization and growth in the amount of data created, shared, and consumed has transformed society greatly. In a world that is interconnected, change happens at a startling pace. Have you ever wondered how this connected world of ours got connected in the first place? The technology is expensive for now and is still in development; however, this technology has the potential to be the future of data technology. In the future, we will be able to use DNA data storage drives, and corporate companies such as Facebook, google, and the like will have DNA data storage data centers.
How did DNA digital storage start?
“Nature does it better” is an aphorism that scientists and engineers have confronted in its various embodiments for yearsdfrom the invention of Velcro, as inspired by the grabbing mechanism of burrs, to the use of aquaporin proteins to desalinate seawater. Now, the biotech community is recognizing that nature does one more thing better than human innovation has thus far been able to: data storage. The story of DNA digital storage is quite interesting. We know that DNA molecules store the genetic blueprint for living cells and organisms. No one really thought that DNA could store binary numbers transcribed in a special DNA scheme. Until 1964, when the Soviet physicist Mikhail Neiman published his works in the Radiotechnika Evkonyve magazine. The topic of the paper was about the possibility of recording, storage, and retrieval of digital information on DNA molecules. He was right on the money!
110
Chapter 5 The digital universe with DNA
Today, the predominant method of storing digital data takes the form of binary code made of strings of 1s and 0s that can be interpreted by our computers. Live cells, on the other hand, store their gene data in strings of As, Ts, Gs, and Cs, which represent the quaternary (base 4) code of DNA. As we know, adenine (A), thymine (T), cytosine (C), and guanine (G) are four building blocks of DNA, so these blocks could also be used to store and retrieve digital information in the DNA using base-4, instead of base-2. In hard drives, we use 0s and 1s to represent data, whereas in DNA, we have A, T, C, and G for data representation. These two systems are similar enough that, with a little bit of cipheringdfor example, making 00, 01, 10, and 11 translate to A, T, G, and C, respectivelydscientists are now accessing an entirely novel physical storage medium unlike any that we have ever harnessed before. DNA storage has a high data density and is stable for thousands of years under low light, temperature, and moisture conditions and represents something of a universal language for all life on the earth. Teams around the world have developed multiple DNA-binary ciphering systems varying in their complexity and are beginning to explore encryption methods and built-in redundancy as features of their coded data. •
•
•
•
•
•
In August 16, 2012, Science magazine published a paper by Dr. George McDonald Church, an American geneticist at Harvard University. He was able to encode digital information to DNA, including a 53,400-word book he wrote along with a variety of images and programs. He was able to store 5.5 petabits (5.5 1015 bits) in one cubic millimeter of DNA. The data bits were mapped one-to-one with DNA bases. His result showed that DNA can also be another type of storage medium, such as hard drives and magnetic tapes. In January 2013, Nature magazine published a research article written by researchers from the European Bioinformatics Institute (EBI). Over 5 million bits of data were stored, retrieved, and reproduced. All the DNA files reproduced the information with between 99.99% and 100% accuracy. In February 2015, researchers from ETH Zurich successfully managed to encode long-term stability data in DNA, and the team added redundancy via ReedeSolomon error correction coding and by encapsulating the DNA within silica glass spheres via solegel chemistry. In 2016, research by Harvard professor Church and Technicolor Research and Innovation Lab was published in which 22 megabytes of an MPEG compressed movie sequence were stored and recovered from DNA, www.wikivisually.com/wiki/DNA_digital_data_storage. In March 2017, Yaniv Erlich and Dina Zielinski of Columbia University, in collaboration with the New York Genome Center, published a work on the method known as the DNA Fountain, which stored data at a density of 215 petabytes (215 1015) per 1 g of DNA. Synthetic DNA could be the answer to the world’s accelerating data storage needs, and now, researchers have shown that it can have a much higher density than previously demonstrated, https://www.eurekalert.org/pub_ releases/2017-03/cuso-rsc022417.php. In March 2018, University of Washington and Microsoft announced results of a study in a Nature Biotechnology article titled, “Random Access in Large-Scale DNA Data Storage.” The study demonstrated storage and retrieval of approximately 200 megabytes of data. The research also
From DNA genetic code to DNA binary code
111
proposed and evaluated a method for random access of data items stored in DNA, https://en. wikipedia.org/wiki/DNA_digital_data_storage.
Another interesting story worth mentioning: On January 21, 2015, Dr. Nick Goldman of the European Bioinformatics Institute (EBI) announced the DNA Storage Bitcoin challenge at the World Economic Forum annual meeting in Davos, Switzerland. During his presentation, Dr. Goldman handed out DNA tubes to the audience, telling them that each tube contained the private key of exactly one bitcoin, all coded in DNA. The first one to sequence and decode the DNA bitcoin key could claim the bitcoin and win the challenge. The challenge was set for 3 yearsdJanuary 21, 2018. Almost 3 years later, on January 19, 2018, the EBI announced that a Belgian PhD student, Sander Wuyts of the University of Antwerp and Vrije Universiteit Brussel, was the first one to complete the challenge. He was the first to master the method and decode the private key, taking possession of the bitcoin. Its value on January 19, 2018 is around $10,666.96. The moral of the story: Decryption of encoded information in the DNA sample demonstrates that the digital storage method is practical.
From DNA genetic code to DNA binary code Total size of the human genes in bytes Most DNA molecules are double-stranded helices. Each molecule consists of two long biopolymers (in biology, an organism is any individual entity that exhibits the properties of life, which is a synonym for “life form.”) made of simpler units called nucleotidesdeach nucleotide (building block) is composed of a base, recorded using the letters G, A, T, and C. DNA is well suited for biological information storage. Human DNA has approximately 3 billion building block pairs, according to the National Human Genome Research Institute (NHGRI). It means 43,000,000,000 possible base sequences. Since humans at an average of 222,500 choices of development, are based on instructions from a commander called epigenome. The epigenome has around 222,500 coding choices for the genes, or 22,500 binary bits. In the end, we come up with a total of 6,000,022,500 information bits, or approximately 6 Gb (gigabits). We usually use ASCII, the 7-bit byte unit for computer storage rather than bits. Six gigabits will amount to 6/7 ¼ 0.857 GB (gigabytes) or 857 MB (megabytes). So the human DNA data are equivalent to 857 MB of digital data. Amazing!
112
Chapter 5 The digital universe with DNA
Why ASCII was so important to DNA codingdA glimpse of history The man who invented the Esperanto of the technology world, enabling computers to swap information freely, is Bob Bemer from IBM, who developed the ASCII coding system to standardize the way computers represent letters, numbers, punctuation marks, and some control codes. Before 1963, computer manufacturers had over 60 different ways of representing characters in computers. Machines could not communicate with one another. This problem was becoming increasingly evident as companies like IBM began networking computers. That year, ASCII (pronounced “AS-KEE”), as shown in Fig. 5.9, the American Standard Code for Information Interchange, was released to serve as a common language among computers. The idea was that 128 charactersdletters, numbers, punctuation marks, and control codesdwould each have a standard numeric value.
FIGURE 5.9 This is the complete table of the American Standard Code for Information Interchange (ASCII). It is the character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII. Octal and hexadecimal codes will not be used during digital encoding in DNA, only ASCII. The only columns of interest are the first (digital), second (binary), and the last column.
Why ASCII was so important to DNA codingdA glimpse of history
113
Among his important contributions was the “escape” sequence. Committee members, working in the limits of seven-bit hardware, could only create 128 characters. Understanding that this was not enough to create a global system, Bemer developed a method allowing computers to switch from one alphabet to another. More than 150 “extra-ASCII” alphabets have been created since 1963. In 1968, President Lyndon B. Johnson signed a memorandum adopting ASCII as the standard communication language for federal computers. ASCII became ubiquitous with the spread of the Internet, as it was the basis for characters in email messages and HTML documents. It was present in hardware and most computer operating systems, although Windows moved away from ASCII with the release of its NT operating system in the late 1990s, which used the Unicode standard. Luckily for DNA storage, ASCII was adopted as the conversion standard. Once we start editing DNA, we will need to keep track of what we do, including revision histories, add comments to new genes, and copyright notices. ASCII standards were suggested for entering information into the genome. Base-4 codons were used to encode 7-bit ASCII. First, we should use numbers to represent the letters in ASCII code from the ASCII table. Let us consider the following letters and their equivalent codes from the ASCII table: The acronym VVIT is broken down into V ¼ 86; V ¼ 86; I ¼ 73; T ¼ 84. Let us assign binary numbers to DNA bases A, T, C, and G: A ¼ 00; T ¼ 01; C ¼ 10; G ¼ 11 (in binary) is equivalent to A ¼ 0; T ¼ 1; C ¼ 2; G ¼ 3 (in decimal) Now, let’ us assign quaternary numbers to: 86 ¼ 1112; 86 ¼ 1112; 73 ¼ 1021; 84 ¼ 1110 Then VVIT ¼ 1112 1112 1021 1110 in DNA code is: TTTC TTTC TACT TTTA
114
Chapter 5 The digital universe with DNA
FIGURE 5.10 The table shows step-by-step the transcription from normal text to DNA code. The process of storing binary information into DNA requires two stages: synthesis and sequencing. The table shows how we converted the word VVIT into DNA code TTTC TTTC TACT TTTA.
Let us represent the conversion graphically as shown in Fig. 5.10, to make it more convincing. To convert digital data to DNA, we need some basic rules. Digital data means normal numbers from 0 to 9. We call them numbers of base 10. Numbers in binary code are represented in 0 and 1. We call them numbers of base 2. The conversion is automatic inside the computer. Here is the conversion table from decimal to binary: So, for example, using the accompanying table, decimal numbers are converted to binary as follows: 9110 ¼ 1001 0001 and 3210 ¼ 0011 0010 and 35710 ¼ 0011 0101 0111 The conversion from binary coded-decimal (BCD) to decimal is the exact opposite of the above. Simply divide the binary number into groups of four digits, starting with the least significant digit and then write the decimal digit represented by each 4-bit group. 10012 ¼ 1001BCD ¼ 910 10102 ¼ this will not produce an error as it is decimal base 1010 and a valid BCD number 10001112 ¼ 0100 0111BCD ¼ 4710 10100111000.1012 ¼ 0101 0011 0001.1010BCD ¼ 538.62510 A team of researchers in the United States has successfully encoded a 5.27-megabit book using DNA microchips, and they then read the book using DNA sequencing. Their experiments show that DNA can be used for long-term storage of digital information. George Church and Sriram Kosuri, of Harvard’s Wyss Institute for Biologically Inspired Engineering, and colleagues encoded Church’s book Regenesis. The book was around 53,400 words (5.27 megabits), and they converted them into DNA sequences, along with 11 images in JPG format and a JavaScript program. This is 1000 times more data than has been encoded in DNA previously.
Why ASCII was so important to DNA codingdA glimpse of history
115
The book’s program and images were converted to HTML and translated into a sequence of 5.27 million (binary) 0s and 1s. The 5.27 megabits were then sequenced into DNA building blocks, called “nucleotides,” of 96 bits long. The A and C blocks were encoded as 0s, whereas G and T blocks were encoded as 1s. Each block has a 19-bit address to assign a place for it in the overall sequence. Multiple copies of each block were synthesized to help in error correction. After the book and other information was encoded into the DNA, drops of DNA were attached to microarray chips for storage. The chips were kept at 4 C for 3 months, ready for retrieval. However, Church’s storage method had error problems. A new method of storage was developed by Dr. Nick Goldman of the European Bioinformatics Institute (EBI) at Hinxton, UK. It marks another step toward using nucleic acids as a practical way of storing informationdone that is more compact and durable than current media, such as hard disks or magnetic tape. The method produced a truly concise anthology of verse by encoding all 154 of Shakespeare’s sonnets in DNA.
Along with the sonnets, the team encoded a 26-second audio clip from Martin Luther King’s famous “I Have a Dream” speech, a copy of James Watson and Francis Crick’s classic paper on the structure of DNA, a photo of the researchers’ institute, and a file that describes how the data were converted. Goldman’s approach was to encode 5.2 million bits of information into DNA, roughly the same amount as Church did, but he used a simple code, where the DNA bases adenine (A) or cytosine (C) represented (0), and guanine (G) or thymine (T) represented (1). This sometimes led to long stretches of the same letter, which is hard for sequencing machines to read and lead to errors. Fig. 5.11 shows the step-by-step conversion from text to DNA code.
116
Chapter 5 The digital universe with DNA
FIGURE 5.11 We took a simple phrase like “File code73xb” and encoded it in base-3. We then converted those numbers into binary code using the ASCII table (that is the second row). Then we converted the message into Huffman Base-3 code. Each base-3 digit will be encoded as any of the DNA bases A, T, G, and C, depending on the letter in the strand that came before. For example, a 0 will be encoded as G if the previous base was C (C and G are sequence complementary). This method complicates the encoding process, but it prevents creating strands with several repetitions of the same base, which can cause errors when sequencing the strand later. To recover the original text, the process can be done in reverse.
Goldman’s group developed a more complex cipher (secret code) made up of a five-letter word as a combination of A, C, G, or T. To try to limit errors further, the team broke the DNA code into overlapping strings, each one 117 letters long with indexing information to show where it belongs in the overall code. The system encodes the data in partially four overlapping strings, in such a way that any errors on one string can be cross-checked against three other strings. Agilent Technologies in Santa Clara, California, synthesized the strings and shipped them back to the researchers, who were able to reconstruct all of the files with 100% accuracy. Most engineers are familiar with basic methods to verify the integrity of data transmitted over noisy communications channels. Simple checksums and cyclic redundancy checks (CRCs) are the most popular. Presented here is a related coding technique by which errors can not only be detected but also removed from received data without retransmissions. The code bears the name of its discoverer, Marcel J. E. Golay, which enhances the reliability of communication on a corrupted data link. Encoding data into a single long strand of DNA is asking for trouble when it comes time to recover the data. A safer process encodes the data in shorter strands. We then construct the first part of the next strand using the same data found at the end of the previous strand. This way we have multiple copies of the data for comparison. Fig. 5.12 shows Goldman’s stepwise encoding.
Example: the gopher message
117
FIGURE 5.12 Stepwise encoding of data into DNA using Goldman’s approach is elucidated in detail. Binary data are converted to base-3 Huffman code, which then is converted to DNA sequences. Each DNA sequence is converted to fragments with each of the 75 base pairs overlapped in alternate fragments with reverse complement.
The Huffman compression rule In ASCII, every character is encoded with the same number of bits: 8 bits per character. Since there are 256 different values that can be encoded with 8 bits, there are potentially 256 different characters in the ASCII character setdnote that 28 ¼ 256. The common characters, e.g., alphanumeric characters, punctuation, control characters, and the like, use only 7 bits; there are 128 different characters that can be encoded with 7 bitsdnote that 27 is 128. Huffman coding compresses data by using fewer bits to encode more frequently occurring characters so that not all characters are encoded with 8 bits. The Huffman lossless compression is an interesting algorithm that we use in digital-to- DNA coding; it allows the original digital data to be perfectly reconstructed from the compressed data.
Example: the gopher message Here is a simple coding example: We have a message “go go gophers,” which is encoded in ASCII. We will demonstrate how, with Huffman coding, we can compress the message into a simpler coding scheme, saving time and space and not losing the characters of the original massage. Fig. 5.13 is compressing 104 bits into 39 bits. With an ASCII encoding (8 bits per character), the 13 characters of the message “go go gophers” require 104 bits. The table below on the left shows how the coding works. By using 3-bit binary code per character, the message “go go gophers” uses a total of 39 bits instead of 104 bits. More bits can be saved if we use fewer than 3 bits to encode characters such as g, o, and
118
Chapter 5 The digital universe with DNA
FIGURE 5.13 The message “go go gophers” can be encoded as 3-bit binary code from the right table.
space, which occur frequently and more than 3 bits to encode characters such as e, p, h, r, and s, which occur less frequently in “go go gophers.” This is the basic idea behind Huffman coding: to use fewer bits for more frequently occurring characters.
Summary of the data storage in DNA The theoretical storage density of DNA is 2 bits of data per nucleotide (i.e., w14 atoms per bit), far greater than any traditionally used storage medium. To date, the best storage density in DNA was performed by Yaniv Erlich’s team at Columbia in 2017, at 1.6 bits of data per nucleotide (80% of the
The CRISPR magic
119
theoretical limit). Erlich’s DNA Fountain technique, based on recent developments in computer coding theory, is capable of packing up to 215 petabytes ( 1015) of data in a single gram of DNAd10 times the improvement over previous DNA storage methods. When scientists built a computer running on DNA in Israel in 2003, it contained none of the silicon, metals, or rare earths used in our devices today. It could also perform 330 trillion operations per second, which was a staggering 100,000 times faster than silicon-based personal computers. A DNA computer would be much “greener” and more in keeping with our 21st century ideas of sustainability and reducing the carbon footprint. DNA computers do not need much energy to work. It is just a case of putting DNA molecules into the right chemical soup and controlling what happens next. If built correctly, and that is where the technical challenge lies, a DNA computer will sustain itself on less than one-millionth of the energy used in silicon chip technology. The reason is simple. DNA is a very dense, highly coiled molecule that can be packed tightly into a small space. It lives in nature inside tiny cells. These cells are only visible under a microscope, yet the DNA from one cell would stretch to 2 m long if uncoiled and pulled straight. The information stored in DNA also can be stored safely for a long time. We know this because DNA from extinct creatures, like the mammoth, has lasted 60,000 years or more when preserved in ice, in dark, cold, and dry conditions.
The CRISPR magic Dr. George Church, behind the Personal Genome Project, once said, “This brings us to the sixth industrial revolution. Like others, this revolution has crucial quantitative measuresdprobability, information and complexitydas well as possibly crucial emerging measures of life, evolution, and intelligence.” Researchers have harnessed bacterial enzymes that allow them to snip or bind DNA strands anywhere they want. So, this is as good a time as any to get acquainted with the powerful new geneediting technology known as CRISPR. A powerful gene-editing technology is the biggest game changer to hit biology since the polymerase chain reaction (PCR), which is a DNA sequence amplifier. Dr. Jennifer Doudna, a professor of chemistry and molecular and cell biology at the University of California, Berkeley, was largely responsible for unleashing the new technique on the world. She said, “CRISPR enables the sorts of modifications that in the past have really been a dream.” Let me explain the meaning of a miracle: The religions interpret a miracle as divine intervention. But this is not what we are interested in. In science, we do not have miraclesdthe term is drastically abused and twisted. We do have unexpected surprises due to an unearthed, unscheduled event, which will be scientifically explained later. A scientific miracle is a stepping-stone for further miracles, which will enhance human endeavor for further progress. A scientific miracle is a means and not an end. Having said that, we can say that CRIPSR is a scientific miracle with great momentum for helping humanity. Fixing DNA defects and eliminating complex disorders such as autism, Alzheimer’s disease, and diabetes is an incredible achievement comparable with the discovery of vaccine. Jerry Lewis will be crying in his grave from joy when he finds out that his muscular dystrophy children will be able to walk in 5 years. CRISPR is truly a Holy Grail. We give credit to two dedicated scientists, Dr. Jennifer Doudna and Dr. Emmanuelle Charpentier, who were able to develop a new technology called CRISPR-Cas9. We can say in one respect that it is a
120
Chapter 5 The digital universe with DNA
technology that defies the course of nature for a better future for “unhappy” people who were born with defects. As eloquently described by Dr. George Church, “This brings us to the sixth industrial revolutiondthe information genomic revolution.” Fig. 5.14 shows the miraculous of gene editing in four steps. CRISPR-Cas9’s power and versatility has opened up new and wide-ranging uses across biology, including medicine and agriculture. The foundational research of the Doudna/Charpentier research team enabled subsequent work by many laboratories throughout the world that used CRISPR-Cas9 to treat and cure diseases in animal models and to create pathways to sustainable biofuels, to more robust crops, and to countless other applications that will continue to dramatically advance human health and well-being (for example, therapies for sickle cell disease).
FIGURE 5.14 Take, for instance, the diagram of a human cell. While these cells do not have CRISPRs, researchers can essentially deliver a customized CRISPR into the cell (1) to direct the Cas-9 enzymes to carry out a specific DNA-editing (2) task. “We can put in artificial DNA sequences that will tell the Cas-9 to go to a mutated gene that causes Huntington’s disease or cystic fibrosis, and we can direct that Cas-9 enzyme to correct (3) or erase that mutation (4) that causes the disease and make it the normal healthy sequence,” Knoepfler added. Designed by MERIT CyberSecurity Engineering.
How to hack DNA Some bacteria have evolved a powerful system, called CRISPR, to defend themselves against viral infections as demonstrated in Fig. 5.15. When a virus strikes, the bacteria copy and store a short, identifying sequence of the virus’ DNAda sort of genetic “memory card.” If the same virus attacks
FIGURE 5.15 Scientists can begin to understand gene function by turning a gene on and off. To do this, they program CRISPR-Cas9 structures in a lab to snip DNA and disable genes that affect health and crops. Synthetic DNA sequences can also be engineered in the lab and sliced in at the site of the cut, introducing desired traits into an organism, such as resistance to a parasite. With CRISPR, scientists can alter and edit any genome that has been sequenced, quickly, cheaply, and efficiently. Designed by MERIT CyberSecurity engineering.
How to hack DNA 121
122
Chapter 5 The digital universe with DNA
future generations of the bacteria, they use the memory card to guide a killer enzyme to the identical sequence in the new invader and cut it away. Scientists have coopted this natural molecular machinery to not only turn off the action of a gene but also insert new genetic code into living organisms, including humans. CRISPR has sparked an explosion of research and a heated ethical debate. So what is so magical about CRISPR-Cas9? In the Bible, John 5:8, “.then Jesus said to the crippled man, ‘Get up! Pick up your mat and walk.’” Now, with CRISPR, we can do the same. All our body’s “natal” defects are pathologically connected to our DNA. We can go to the source and repair some and remove mutations as well.
Here is an appalling revelation that we should wire to our heads: Scientific American magazine, in its November 30, 2007 issue, mentions “There are 10 times more cells from viruses in and on our bodies than there are human cells. There have been billions of tiny elephants in the room: viruses.” Simply saying: Our body system is almost like a Russian nesting doll Matryoshka: The virus lives inside the bacterium, which lives inside our gut.
Anatomy of CRISPRdthe smart cleaver Word on word processing Our world is chained to do many vital things beyond listing here, but one of them is word processing. In fact, the term should be renamed “document builder” because it builds a document and makes any changes to it. We can replace a word, paragraph, delete a term, or relocate it to another paragraph. In the old days, we used a stripped-down typewriter, which was the darling of the office. And if the boss wanted a few revisions to the document, the secretary would gladly retype the entire thing. Then Paper Mate Liquid was born, and it was a giant leap in document editing. No document was immune from this magic liquid. No business could survive without it. There is no trace of the very first alleged word processors. The oldest record we have is from 1714, when an English engineer named Henry Mill got a patent for a machine capable of writing so clearly
Anatomy of CRISPRdthe smart cleaver
123
and accurately you could not distinguish it from a printing press. In 1867, Christopher Latham Sholes created the first “typewriter” that, although huge, everyone would identify at first sight. In his time, the prestigious Scientific American magazine described it as a “literary piano,” https://archive.org/stream/ Mr.Typewriter-ABiographyOfChristopherLathamSholes/Mr.%20Typewriter-%20A%20biography% 20of%20Christopher%20Latham%20Sholes_djvu.txt. Fig. 5.16 shows the evolution of the typewriter and word processing. We all remember the white correction liquid. Thomas Edison patented the electric typewriter in 1872, but the first workable model was not introduced until the 1920s. In the 1930s, IBM introduced a more refined version, the IBM Electromatic. It “greatly increased typing speeds and quickly gained wide acceptance in the business community.” In October 1983, Microsoft Word was first released under the name Multi-Tool Word for Xenix systems. In May 1973, Wang Laboratories presented Wang 2200 in the United States and Great Britain. Wang 2200 was a good machine and had a large client base. At its peak in the 1980s, Wang Labs had revenues of $3 billion a year and employed over 33,000 people. AN Wang, CEO of Wang Laboratories, was a head-on competitor was IBM but could not hold a candle to IBM, which monopolized the computing industry in the world. Then the world moved to WordPerfect. The fast emergence of MS Office was a great strategic move. We can deduce that the failure of WordPerfect was due to two critical reasons: First, WordPerfect indeed ran into trouble when it did not move quickly into the Windows environment. Second, the deathblow of the Office bundling of other productivity applications (PowerPoint, Excel, Access) was just a really smart move on Microsoft’s part.
FIGURE 5.16 In 1978, WordStar appeared on the market. The first word processing software became popular among computer owners with CP/M, then DOS, and then Windows. WordStar was slowly replaced by WordPerfect in the mid-1980s, becoming the “standard” for DOS.
124
Chapter 5 The digital universe with DNA
Artificial intelligenceecentric text editing Whether you are looking for a needle in a haystack or revising your boss’ memo, the process, although it is tedious, is the same. Fortunately, artificial intelligence, which is fast permeating every aspect of human life, is waking up the computing giants to include it in their products. Fig. 5.17 shows how AI is taking over many of our life activities, not only word processing, but office assistance. Now let us talk about different kind of editing, CRISPR Genome editing, or gene editing, is a technology that gives scientists the ability to change an organism’s DNA. CRISPR is a genetic engineering editing tool. The CRISPR technology relies on two componentsdan enzyme and a guide molecule. First, the guide molecule (the intelligence unit) knows how to locate the troubled gene. The second component, the enzyme (direct Cas9) is the soldier that goes to the “target” gene that needs to be modified. Once there, Cas9 cuts the gene, and then it can be repaired in many ways. We can either change the function of the gene, take away the gene completely, or make the gene more active. Fig. 5.18 shows the four steps in CRISPR dynamic editing. CRISPR-Cas9 was adapted from a naturally occurring genome editing system found in bacteria. It is like the human immune system. The CRISPR-Cas9 process is very simple: the cell bacteria capture snippets of DNA from an invading virus (let us call it a vaccine) and use them to create DNA samples known as a CRISPR batch. CRISPR arrays allow the bacteria to “remember” the virus. If the virus attacks again, the bacteria produce RNA segments from CRISPR arrays to target the virus’ DNA.
FIGURE 5.17 Today, Microsoft, IBM, and Google are racing with AI to make an innovative model for the legacy word processor and make writing smarter. They are using supervised learning to check for repetitive words or if the grammar is not correct. It is still rudimentary technology, but given it a couple of years, you will have an AI assistant! The technology of document assistancedfor professional writersdis moving in leaps and bounds. AI, artificial intelligence.
Step 1: Scientists create a genetic sequence called a “guide RNA” that matches the pieces of DNA they want to modify. Step 2: This sequence is added to a cell along with a protein called Cas9, which acts like a pair of scissors that cleave DNA. Step 3: The guide RNA homes in on the target DNA sequence and cuts it out. Once their job is complete, the guide RNA and Cas9 leave the scene. Step 4: Now, another piece of DNA is swapped into the place of the old DNA, and enzymes repair the cuts.
Artificial intelligenceecentric text editing
FIGURE 5.18
125
126
Chapter 5 The digital universe with DNA
The bacteria then use Cas9 (scissors) or a similar enzyme to cut the DNA apart, which disables the virus. Dr. Barrangou, as an associate professor at N.C. State University in food science, aptly describes CRISPR as a “molecular scalpel to do molecular surgery.” RNA is supplemented with a short “guide” sequence (intelligence instructions to locate the target), which is attached to the victim’s DNA. RNA works closely with the Cas9 enzyme. The RNA carries the vaccine (sample from the invader DNA) to the target DNA and instructs the Cas9 enzyme (plumber) to cut the DNA at the targeted location. This is the machinery used to add or delete pieces of genetic material or to make changes to the DNA by replacing an existing segment with a customized DNA sequence. CRISPR-Cas9 is like the Swiss Army knife. It is faster, cheaper, more accurate, and more efficient than other existing genome-editing methods. It can handle any upgradeddeletion of or replacement of genetic material. We should call CRISPR the biological editor of genes. In addition, CRISPR is a big contributor in digital data storage (DDS) in DNA. We will be talking about that in the next three chapters.
Ethical concerns Ethical concerns arise when gene editing, using technologies such as CRISPR-Cas9, is used to alter human genomes. Most of the changes introduced with genome editing are limited to somatic cells, which are cells other than egg and sperm cells. These changes affect only certain tissues and are not passed from one generation to the next. However, changes made to genes in egg or sperm cells (germline cells) or in the genes of an embryo could be passed to future generations. Germline cell and embryo genome editing bring up a number of ethical challenges, including whether it would be permissible to use this technology to enhance normal human traits (such as height or intelligence). Based on concerns about ethics and safety, germline cell and embryo genome editing are currently illegal in many countries.
CRISPR is the Holy Grail of data deluge The biomedical community hails the magical cutting edge of gene therapy, discovered by Drs. Jennifer Doudna and Emmanuelle Charpentier. A new, bright future has prevailed showing how that with DNA’s right coding, the double helix could archive our civilization. This is, in fact, the main focus of this bookdto learn how to design the next generation of storage. Just imagine: Catalog Technologies of Boston has stepped into the future. Hyunjun Park, the CEO of the company, said, “We are encoding the world’s digital information into DNA fitting entire data centers into the palm of your hand. We’re using synthetic DNA that will change the way data is stored in the future.” Surprisingly, all the big tech companies are jumping on the DNA bandwagon to captivate the CIOs’ minds that all their millions of CDs can be synthesized and stored on 1 g of DNA (see Fig. 5.19). It is a deep thought that scares the dickens out of them. DNA storage is an unimaginable, disruptive paradigm that will be adopted sooner or later.
Appendices
127
FIGURE 5.19 No storage technology company in the digital universe can hold a candle to the world’s smallest hard drive, not even 300 million DNA sequences. In 2 h for around 1000 CD which can be stored on 1.5 mL.
Appendices Appendix 5.A Glossary of data center terms (MERIT CyberSecurity engineering) 3D NAND: A next-generation type of nonvolatile memory technology (flash) that is becoming more mainstream and prevalent among enterprises. It has the advantage of being able to pack more bits into the same dimensions as older NAND technology. Most flash memory storage today is still planar, meaning it is two dimensional. However, as photolithography reaches its practical limits, squeezing more bits into each flash NAND cell becomes more difficult. Chip makers are going vertical with 3D NAND. Think of 3D NAND as a multistorey building and 2D NAND as a single-storey building. Both occupy the same amount of real estate (XeY dimension), but the multistorey building is more efficient within the same space because it expands upward. Application layer: The layer (there are seven layers in the traditional Open Systems Interconnection ([OSI stack]) that is closest to the end user in the conceptual model that characterizes the communication functions of a computing system. This means both the application layer and the user interact directly with the given software application being used. The application layer provides full end-user access to a variety of shared network services for efficient data flow. The application layer becomes increasingly important in virtualized and containerized environments where it is abstracted from the physical infrastructure where it runs. It also gives way to applications that program their own infrastructure needs. See Application-specific policies for more. Application-specific policies: Policies (typically as they relate to infrastructure such as servers, storage, networks, and security in the modern data center) that are specifically tied to individual applications and the retrieval of data in a bare metal, cloud, container, or virtual machine environment. Automation: A key concept in cloud computing. Automation is what sets a cloud infrastructure apart from a virtualized infrastructure. It includes the ability to provision resources on demand without
128
Chapter 5 The digital universe with DNA
the need for manual, human intervention. Automation is often combined with orchestration to create the ability of a service to be integrated with and fully support the many tools now available to IT to gain control and mastery of their operations. For example, a software-defined storage or softwaredefined networking solution is one that can easily plug into automation and orchestration tools used in the rest of the data center without requiring customization or modification for a particular environment. Cloud foundry: An open-source cloud platform as a service (PaaS) originally created by VMware and is now part of Pivotal Software. It is governed by the Cloud Foundry Foundation and is a PaaS on which developers can build, deploy, run, and scale applications in both public and private cloud environments. The platform leverages containers to deploy applications and enables businesses to take advantage of innovation from projects such as Docker and Kubernetes to increase the ease and velocity of managing production-grade applications. Cluster: A networked collection of server computers that, in many respects, can be viewed as a single system. The term’s meaning can vary depending on the context. However, in the context of the modern data center, a cluster is a group of servers and other resources that function as a single system, sometimes with elements of parallel processing. Many clusters are also distributed systems. See below for related definition. Container(s): Software technology giving a lightweight and portable method for packaging an application that provides isolation from an operating system (OS) and physical infrastructure. Unlike a virtual machine, containers do not include a full OS but instead share the OS of a host. Containers allow an application to be packaged and abstracted to simplify deployment among different platforms. Examples include Docker and Linux Containers (LXC). Containers are often associated with microservices, defined below. “Container” may also refer to a granular unit of data storage. For example, Amazon S3 (simple storage service) uses the term “bucket” to describe a data container. In certain Scientific Data Systems (SDS) solutions, the data that make up virtual disks are stored in logical containers housed on various nodes in a cluster. Control plane: Originally a network term, the control plane is generally anything related to the “signaling” of the network. Control plane packets are destined to or locally originated by the router itself. It makes decisions about where traffic is sent, and its functions include system configuration, management, and exchange of routing table information. However, with the rise of software-defined infrastructure, “control plane” is now a term that extends to server, storage, and security infrastructure. It refers to the programmable set of application programming interface (APIs) that govern the configuration, management, and monitoring of the infrastructure. Data layer: A term with a number of definitions (including application as a marketing buzzword). However, in the context of a modern data center, the data layer is a data structure that holds all data that need to be processed and passed within a digital context (as in a website, for example) to other linked applications. Data plane: Also known as the forwarding plane, it forwards traffic to the next hop along the path to a selected destination network according to the control plane logic (the learned path that the data on the data plane takes). Also, originally a networking term, the data plane consists of data (packets) that are sent through the router itself on its way to its next destination. Increasingly, data plane refers to the infrastructure that stores, manages, protects, and transmits data for all applications. Distributed system: A cluster of autonomous computers networked together to create a single unified system. In a distributed system, networked computers coordinate activities and share resources to support a common workload. The goals of distributed systems are to maximize performance and scalability, ensure fault tolerance, and enable resource availability. Examples of distributed systems include Amazon Dynamo, Google MapReduce, and Apache Hadoop.
Appendices
129
Docker: An open-source project that automates the deployment of applications within software containers. Docker containers, like other containers, wrap a piece of software in a complete file system containing everything needed to run: code, runtime, system tools, system libraries, and so forth. Docker is often synonymous with containers, and many use the term interchangeably. It is important to note that Docker is both an open-source set of tools and a company, which supports the open-source technology as well as selling its own proprietary software. DRaaS: Disaster recovery as a service (DRaaS) is the replication and hosting of physical or virtual infrastructure by a dedicated provider to enable failover in the event of a human-made or natural catastrophe. DRaaS is one of the primary drivers of cloud computing and often the primary motivation behind adopting a hybrid or multicloud architecture. Flash: A storage device that stores persistent data on nonvolatile solid-state memory chips. Unlike spinning electromechanical disks (i.e., hard disk drives), flash drives have no moving parts. Flash also typically produces no noise, stores and accesses data more quickly, has less latency, and is more reliable and durable than spinning media. Since the technology is more advanced, the cost of flash is usually higher although the cost of flash is decreasing as production methods are refined, improved, and scaled. Hybrid cloud: A cloud computing environment in which private cloud resources (e.g., on-premise data center) are managed and utilized together with resources provisioned in a public cloud (e.g., Amazon Web Services). Typically, applications and data are exchanged across this private/public cloud boundary, creating a single logical infrastructure, or set of services. Hyperconverged: An architecture that combines software-defined computing and softwaredefined storage together on a commodity server to form a simplified, scale-out data center building block. The “hyper” in hyperconvergence comes from hypervisor, the server virtualization component of the solution. Hyperscale: An architecture where software-defined computing and software-defined storage scale independently of one another. A hyperscale architecture is well suited for achieving elasticity because it decouples storage capacity from computing capacity. Hyperscale architectures underpin Web giants, including Google and Amazon, and is increasingly being adopted by other enterprises as a means to efficiently scale or contract an environment over time. IaaS: Infrastructure as a service (IaaS) is a form of cloud computing in which virtualized computing resources are provided over the Internet. It is considered one of the three main categories of cloud computing, along with software as a service (SaaS) and platform as a service (PaaS). These computing resources are typically billed on a utility computing basis (pay as you go; pay as much as you use). It is well suited for achieving elasticity because it decouples storage capacity from computing capacity. It is a service model that delivers virtualized infrastructure on an outsourced basis to support organizations. Among its benefits are automated administrative costs, self-serviceability, dynamic scaling, flexibility, and platform virtualization. Kubernetes: Another popular open-source system for automating the deployment, scaling, and management of containerized applications. Originally designed by Google, it was donated to the Cloud Native Computing Foundation. Kubernetes defines a set of building blocks that collectively provide the mechanisms for deploying, maintaining, and scaling applications. Kubernetes is also designed to be loosely coupled and extensible so it can accommodate a wide range of workloads. Mesos: Formally known as Apache Mesos, it is an open-source software to manage computer clusters that was originally developed at the University of California, Berkeley. Apache Mesos abstracts CPU, memory, storage, and other computer resources away from machines (be they physical or virtual) and allows for fault-tolerant and elastic distributed systems to be built and run easily and
130
Chapter 5 The digital universe with DNA
effectively. It sits between the application layer and the operating system and eases deploying and managing applications in large-scale clustered environments. It was originally designed to manage large-scale Hadoop environments but has since been extended to manage other types of clusters. Microservices: A method of developing software applications as a suite of independently deployable, small, modular services in which each service runs a unique process and communicates through a well-defined, lightweight mechanism. The idea behind microservices is that some applications are easier to build and maintain when they are broken down into smaller, composable elements. When the different components of an application are separated, they can be developed concurrently while another advantage of microservices is resilience. Components can be spread across multiple servers or data centers; if a component dies, one only needs to spin up another component elsewhere, and the overall application continues to function. Microservices are similar but differ from a servicesoriented architecture (SOA) in that each service can be independently operated and deployed. The rise in microservices’ popularity is tied to the emergence of containers as a way of packaging and running the code. Multicloud: The use of two or more public cloud computing service providers by a single organization. Hybrid clouds can be multiclouds if two or more public clouds are used in conjunction with a private cloud. Multicloud environments minimize the risk of data loss or downtime due to a failure occurring in hardware, infrastructure, or software at the public cloud provider. A multicloud approach can also be used as part of a pricing strategy to keep costs under control and prevent vendor lock-in to one cloud provider. This method can increase flexibility by mixing and matching best-in-class technologies, solutions, and services from different public cloud providers. Multisite replication: The ability to natively replicate data among different sites to ensure locality and availability. A site can represent a private cloud data center, public cloud data center, remote office, or branch office. Multisite replication prevents any one site from being a single point of failure. Multitier: A type of application that is developed and distributed among more than one layer and logically separates the different application-specific, operational layers. The number of layers varies by business and application requirements, but three-tier is the most commonly used. The three tiers are: presentation (user interface); application (core business or application logic); and data (where the data are managed). Also known as N-tier application architecture, it provides a model in which developers can create flexible and reusable applications. Multitier can also refer to the data storage. Multitier represents a single storage platform that spans multiple, traditional tiers of storage. In this case, each tier is defined by the specific performance and availability needs of applications. Tier 0 or 1 is often the highest performance, highest availability applications (often serviced by all-flash arrays), whereas tier 3 or 4 is often the lowest performance, lowest availability applications (often serviced by archive or cold archive storage). Multiworkload: A distributed computing environment in which different workloads (all of which may have differing characteristics) are equally supported, managed, and executed. Just as there are different types of bicycles for different uses, different computing workloads place different demands on the underlying infrastructure, whether it be a desktop workload or an Systeme, Anwendungen und Produkte in der Datenverarbeitung (SAP) system workload. Different workloads have different characteristics in terms of computing capacity, network needs, data storage, backup services, security needs, network bandwidth needs, QoS metrics, among other factors. Multiworkload is gaining prominence as companies look to build cloud environments where a single, shared infrastructure supports all the workload or application needs. This is in sharp contrast to traditional, siloed environments where workloads often have bespoke infrastructures. In a multiworkload cloud, softwaredefined technologies and application-specific policies enable a single infrastructure to meet the needs of a diverse set of applications.
Appendices
131
Node: A widely used term in IT that may refer to devices or data points on a larger network. Devices such as a personal computer, cell phone, or printer are considered nodes. Within the context of the Internet, a node is anything that has an IP address. When used in the context of modern data centers, it can refer to a server computer. Often the different computers that make up a cluster or multiworkload. OpenStack: A free and open-source software platform for cloud computing, deployed mostly to underpin private or public cloud infrastructure as a service (IaaS). SaaS: The software platform consists of interrelated components that control diverse, multivendor hardware pools of processing, storage, and networking resources throughout a data center. Users manage it via a Web-based dashboard, command-line tools, or through a RESTful API. Orchestration layer: Consisting of programming that manages the interconnections and interactions among cloud-based and on-premises components. In this layer, tasks are combined into workflows so the provisioning and management of various IT components and associated resources can be automated with several tools or managers such as Puppet, Chef, Ansible, Salt, Jenkins, among others. Traditional data center infrastructure management tools such as VMware, vSphere, Microsoft Hyper-V, and OpenStack are also considered part of the orchestration layer. PaaS: An application platform that is a category of cloud computing services providing a platform that allows customers to develop, run, and manage applications without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app. There are different types of PaaS, including public, private, and hybrid. Originally intended for applications on public cloud services, PaaS has expanded to include private and hybrid options. PCIe: An abbreviation of peripheral component interconnect express, it is a serial expansion bus standard for connecting a computer to one or more peripheral devices. With PCIe, data center managers can take advantage of high-speed networking across server backplanes and connect to Gigabit Ethernet, RAID, and InfiniBand networking technologies outside of the server rack. It offers lower latency and higher data transfer rates than parallel busses such as PCI and PCI-X. Private cloud: A type of cloud computing designed to deliver similar advantages of the public cloud (including scalability, flexibility, and self-service) but is dedicated to a single organization. A large multinational enterprise, for example, may establish its own private cloud that mimics the characteristics of those offered by public cloud providers, which deliver services to multiple companies concurrently. Private clouds can be deployed in wholly owned data center facilities or hosted in outsourced facilities. Thus, private cloud does not necessarily mean on-premises, although most are deployed as such. Scale-out: Used to describe a type of architecture that may apply to storage networking or apply cations. In general, scale-out refers to adding more components in parallel to spread out a workload. In most cases, scale-out adds more controllers with each node added to the scale-out system. This enables higher levels of scalability, performance, and resiliency. This contrasts with scale-up, which refers to adding more capacity to the system without adding more controllers. Most scale-up systems are a dualcontroller model and represent scaling, performance, and resiliency limits based on this constraint. Software-defined system: An increasingly widely used term in storage, networking, and other IT applications, it generally refers to a new class of products where the software is deployed on commodity hardware to provide a capability. Traditional, or hardware-defined, systems have a tight coupling of software to proprietary hardware components or designs. Software-defined abstracts physical resources, automates actions, and enables the programming of infrastructure to meet specific application and workload needs.
132
Chapter 5 The digital universe with DNA
Stretch(ed) clusters: Deployment model in which two or more virtualization host servers are part of the same logical cluster but which are in separate geographical locations. In stretched clusters, the servers act as a single system to provide high availability and load balancing despite not being in the same facility. They have the advantage of enabling easier migration of virtual machines from one physical location to another while maintaining network connections with the other servers in the cluster. Tier(s): Can refer to multitiered architecture (as defined earlier) but in the context of storage, determines the priority and importance of an organization’s data. Tier 1 data, for example, are the data that an organization or computing environment must have immediate access to for the most missioncritical applications. Tier 2 data most often include business-critical application data, and the type of storage will depend on performance and availability requirements. Tier 3 data typically refer to backups, whereas archived data are typically tier 4 or higher. UDP: Conceived by Hedvig, the universal data plane is a single, programmable data management layer spanning workloads, clouds, and tiers that capitalize on the distributed systems approach being adopted by many organizations. It is a virtualized abstraction layer that enables any workload to store and protect its data across any location. It also dramatically simplifies operations by plugging into modern orchestration and automation frameworks such as Docker, Kubernetes, Mesos, Microsoft, OpenStack, and VMware.
Appendix 5.B Glossary for big data (prepared by MERIT CyberSecurity Engineering) With billions of bytes of data being collected daily, itis more important than ever to understand the intricacies of Big Data. In an effort to help bring clarity to this field, we created a compiled list from our recent Big Data guides of what we feel are the most important related terms and definitions you need to know. (By the way, if you are interested in this, you might also be interested in our AI glossary!) Algorithm: A set of rules given to an AI, neural network, or other machines to help it learn on its own; classification, clustering, recommendation, and regression are four of the most popular types. Apache Flink: An open-source streaming data processing framework. It is written in Java and Scala and is used as a distributed streaming dataflow engine. Apache Hadoop: An open-source tool to process and store large distributed data sets across machines by using MapReduce. Apache Kafka: A distributed streaming platform that improves upon traditional message brokers through improved throughput, built-in partitioning, replication, latency, and reliability. Apache NiFi: An open-source Java server that enables the automation of data flows between systems in an extensible, pluggable, open manner. NiagaraFiles (NiFi) was open-sourced by the National Security Agency (NSA). Apache Spark: An open-source Big Data processing engine that runs on top of Apache Hadoop, Mesos, or the cloud. Artificial intelligence: A machine’s ability to make decisions and perform tasks that simulate human intelligence and behavior. Big Data: A common term for large amounts of data. To be qualified as Big Data, data must be coming into the system at a high velocity, with large variation, or at high volumes. Blob storage: An Azure service that stores unstructured data in the cloud as a blob or an object.
Appendices
133
Business intelligence: The process of visualizing and analyzing business data for the purpose of making actionable and informed decisions. Cluster: A subset of data that share particular characteristics. Can also refer to several machines that work together to solve a single problem. COAP: Constrained application protocol is an Internet application protocol for limited resource devices that can be translated to HTTP if needed. Data engineering: The collection, storage, and processing of data so that it can be queried by a data scientist. Data flow management: The specialized process of ingesting raw device data, while managing the flow of thousands of producers and consumers. Then performing basic data enrichment, analysis in stream, aggregation, splitting, schema translation, format conversion, and other initial steps to prepare the data for further business processing. Data governance: The process of managing the availability, usability, integrity, and security of data within a data lake. Data integration: The process of combining data from different sources and providing a unified view for the user. Data lake: A storage repository that holds raw data in its native format. Data mining: A practice to generate new information through the process of examining and analyzing large databases. Data operationalization: The process of strictly defining variables into measurable factors. Data preparation: The process of collecting, cleaning, and consolidating data into one file or data table, primarily for use in analysis. Data processing: The process of retrieving, transforming, analyzing, or classifying information by a machine. Data science: A field that explores repeatable processes and methods to derive insights from data. Data swamp: What a data lake becomes without proper governance. Data validation: The act of examining data sets to ensure that all data are clean, correct, and useful before it is processed. Data warehouse: A large collection of data from various sources used to help companies make informed decisions. Device layer: The entire range of sensors, actuators, smartphones, gateways, and industrial equipment that send data streams corresponding to their environment and performance characteristics. GPU-accelerated databases: Databases that are required to ingest streaming data. Graph analytics: A way to organize and visualize relationships between different data points in a set. Hadoop: A programming framework for processing and storing Big Data, particularly in distributed computing environments. Ingestion: The intake of streaming data from any number of different sources. MapReduce: A data processing model that filters and sorts data in the Map stage and then performs a function on that data and returns an output in the Reduce stage. Munging: The process of manually converting or mapping data from one raw form into another format for more convenient consumption. Normal distribution: A common graph representing the probability of a large number of random variables, where those variables approach normalcy as the data set increases in size. Also called a Gaussian distribution or bell curve.
134
Chapter 5 The digital universe with DNA
Normalizing: The process of organizing data into tables so that the results of using the database are always unambiguous and as intended. Parse: To divide data, such as a string, into smaller parts for analysis. Persistent storage: A nonchanging place, such as a disk, where data are saved after the process that created it has ended. Python: A general-purpose programming language that emphasizes code readability to allow programmers to use fewer lines of code to express their concepts. R: An open-source language primarily used for data visualization and predictive analytics. Real-time stream processing: A model for analyzing sequences of data by using machines in parallel, though with reduced functionality. Relational database management system (RDBMS): A system that manages, captures, and analyzes data that are grouped based on shared attributes called relations. Resilient distributed data set: The primary way that Apache Spark abstracts data, where data are stored across multiple machines in a fault-tolerant way. Shard: An individual partition of a database. Smart data: Digital information that is formatted so it can be acted upon at the collection point before being sent to a downstream analytics platform for further data consolidation and analytics. Stream processing: The real-time processing of data. The data are processed continuously, concurrently, and record-by-record. Structured data: Information with a high degree of organization. Taxonomy: The classification of data according to a predetermined system with the resulting catalog used to provide a conceptual framework for easy access and retrieval. Telemetry: The remote acquisition of information about an object (for example, from an automobile, smartphone, medical device, or IoT device). Transformation: The conversion of data from one format to another. Unstructured data: Data that either do not have a predefined data model or are not organized in a predefined manner. Visualization: The process of analyzing data and expressing it in a readable, graphical format, such as a chart or graph. Zones: Distinct areas within a data lake that serves specific, well-defined purposes.
Appendix 5.C Glossary of CRISPR (MERIT Cybersecurity knowledge base) Acquisition: Process by which a new spacer is integrated into the CRISPR locus. The ability of CRISPR loci to acquire novel spacers derived from invasive nucleic acids drives the adaptive nature of this immune system. Cas: CRISPR-associated gene. These Cas genes encode a functionally diverse set of Cas proteins that are directly involved in one or more stages of the CRISPR mechanism of action: spacer acquisition, CRISPR locus expression, and target interference. These genes often reside in operons and are typically genetically linked with CRISPR repeat-spacer arrays. Cascade: CRISPR-associated complex for antiviral defense. Multisubunit Cas protein complex required for crRNA processing and maturation, and CRISPR-mediated interference. Homologs from the archetype Cascade complex from Escherichia coli are found in all type I CRISPR-Cas subtypes. The other CRISPR (sub)types have distinct ribonucleoprotein complexes (type II, a multidomain protein Cas9; type III, a multisubunit RAMP complex).
Appendices
135
CRISPR: Clustered regularly interspaced short palindromic repeats. Genetic loci that contain arrays of homologous direct DNA repeats separated by short variable sequences called spacers, although not all repeat sequences are actually palindromic. This hallmark of CRISPR-Cas systems encodes the CRISPR transcript, which is processed into small interfering RNAs. CRISPR-Cas: Immune system comprising a CRISPR repeat-spacer array and accompanying Cas genes. There are generally three distinct types of CRISPR-Cas systems, namely type I, type II, and type III, as defined by the content and sequences of their elements, notably Cas genes. crRNA: CRISPR RNA. Mature small, noncoding CRISPR RNA generated by cleavage and processing of a precursor CRISPR transcript (precrRNA), which guides the Cas interference machinery toward homologous invading nucleic acids. Interference: Process by which invasive DNA or RNA is targeted by crRNA-loaded Cas proteins. This process relies on sequence homology and complementarity between the crRNA and the target nucleic acid. In some cases, ancillary elements are necessary for target interference, such as tracrRNA and PAMs, and for preventing autoimmunity. Leader: AT-rich sequence located upstream of the first CRISPR repeat. This sequence serves as a promoter for the transcription of the repeat-spacer array. Also, this sequence defines CRISPR locus orientation for transcription and polarized spacer acquisition. PAM: Protospacer adjacent motif. Short signature sequences (typically 2e5 nucleotides (nt)) flanking a protospacer, which is necessary for the interference step in most type I and type II DNAtargeting systems. Pre-crRNA: Pre-CRISPR RNA. Full-length transcript generated by the CRISPR repeat-spacer array, which serves as the precursor for crRNA biogenesis via one or more processing and maturation steps. Proto-spacer: A spacer precursor in invasive nucleic acid. Precursor sequence of CRISPR spacers in the DNA of invasive elements that will be sampled by the CRISPR-Cas immune system as part of the acquisition process and subsequently targeted by crRNA as part of the interference process. RAMP: Repeat-associated mysterious proteins. Subset of Cas proteins with an RNA recognition motif (RRM fold). Some RAMPs have endoribonuclease activity involved in the maturation and processing of crRNA. Repeats: Short sequence repeated within a CRISPR array, separated by spacers. Highly similar sequences that form direct repeats spaced by short variable sequences of conserved length. The CRISPR repeat sequence is critical for crRNA maturation and processing and is functionally coupled with Cas genes to form a functional CRISPR-Cas system. Only a subset of all repeat types are palindromic. R-loop: Section of DNA associated with RNA forming a loop. Structure in which RNA hybridizes with double-stranded DNA. The RNA base pairs with a complementary sequence in one of the strands of a DNA molecule, causing the displaced strand to form a loop. RNAi: RNA interference. Process in eukaryotes by which small noncoding RNA molecules guide enzymatic cleavage of complementary mRNAs, through the RNA-induced silencing complex (RISC). Seed sequence: Short sequence within the crRNA that requires perfect base pairing with the target sequence. Short stretch of nucleotides (7e9 nt) which enthalpically drives hybridization between the interfering crRNA and the complementary target strand in the vicinity of the PAM site, which supports R-loop formation and generally results in interference.
136
Chapter 5 The digital universe with DNA
Appendix 5.D A list of features standard in current word processing programs Adjustment: Realignment of text to new margin and tab settings. Alignment: positioning text or numbers to specified margin and tab settings. Automatic spelling checker and corrector: Program that compares words in the text against an online dictionary, flagging items not found in the dictionary and offering alternative spellings and a means of correcting the errors. Boilerplate: The separate storage and retrieval of blocks of text from which standard documents can be built. Centering: Moving text on a line to be centered online. Copying or cutting: The duplication or moving of blocks of text within a document. Decimal alignment: Positioning columns of numbers with the decimal points vertically aligned. Deletion: Erasure of text from the screen, or of whole documents from the disk. Discretionary hyphenation: Option of inserting a hyphen to break a word that ends a line: the hyphen does not print if later editing moves the word to the middle of a line. Footnoting: Automatic sequential numbering of footnotes and positioning of the footnotes at the bottom of their appropriate pages during pagination. Form letter merging: Automatic combining of a form letter with a mailing list to generate multiple copies of the letter with the different addresses and other variable information filled in. Headers and footers: Option of creating standard blocks of text that will automatically appear at the top or bottom of each page in a document. Indents: The setting of temporary margins within a document differing from the primary margins used. Insertion: The entry of new text within previously typed material without erasing the existing material. Justification: Automatic alignment of text to both the left and right margins. Overstriking: The substitution of new text for old by typing over the old text. Page numbering: Automatic sequential numbering of pages. Pagination: Automatic division of a document into pages of specified numbers of lines. Search and replace: Moving directly to specified words or parts of words within a document and replacing them with different words or word portions. Table of contents and index generators: Programs that create these based on the text of a document. Word wrap: Automatic arrangement of text in lines of specified length without the necessity of touching the return key.
Suggested readings https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/topics/workforce/Seagate-WP-DataAge2025- March2017.pdf. AI and editing https://www.computerworld.com/article/3175788/microsoft-windows/what-this-new-ai- featurein-microsoft-word-teaches-us-about-ourselves.html. AI glossary https://dzone.com/articles/ai-glossary. Amino acids grouped by three www.ncbi.nlm.nih.gov/books/NBK22358/Big Data on DNAhttps://arxiv.org/pdf/ 1310.6992.pdf. Binary representation in DNA https://www.quora.com/Is-our-DNA-coded-like-a-computer. CRISPR https://www.nature.com/articles/nature23017. Data Storage in DNA https://www.ebi.ac.uk/about/news/press-releases/DNA-storage.
Suggested readings
137
Decoding DNA into binary https://www.mathworks.com/matlabcentral/answers/244696-decoding-dnasequence-into-binary. DNA-based archival https://homes.cs.washington.edu/wbornholt/dnastorage-asplos16/. DNA cloud https://arxiv.org/pdf/1310.6992.pdf. DNA for kids https://www.pinterest.com/search/pins/?q¼dna%20structure&rs¼srs&source_id¼ers_ xM6mq5Td . DNA Storage https://creation.com/dna-best-information-storage. DNA works for Mikhail_Samoilovich_Neiman https://en.wikipedia.org/wiki/Mikhail_Samoilovich_Neiman. Dr. George Church https://wikivisually.com/wiki/DNA_digital_data_storage. Dr. George Church www.sciencemag.org/news/2017/02/how-battle-lines-over-crispr-were-drawn. Emmanuelle Charpentier https://www.nature.com/news/the-quiet-revolutionary-how-the-co-discovery- of-crisprexplosively-changed-emmanuelle-charpentier-s-life-1.19814. Encoding https://www.researchgate.net/publication/271532092_On_optimal_family_of_codes_for_archival_ DNA storage/figures¼1. Excellent https://homes.cs.washington.edu/wbornholt/dnastorage-asplos16/. Expanding digital world (graph) https://pdfs.semanticscholar.org/presentation/f3a9/655a7bff9b13cf39ea9c485423 51242d3872.pdf. Expanding the digital universe http://www.idema.org/wp-content/downloads/1785.pdf. Church, G., Regis, E., 2012. Regenesis. In: ”. Basic Books Publishing. Huffman coding “go go gopher” https://www2.cs.duke.edu/csed/poop/huff/info/. Huffman coding https://en.wikipedia.org/wiki/Huffman_coding. Hyperscale datacenters https://www.rca.ac.uk/schools/school-of-architecture/architecture/ads-themes-201819/ ads8-data-matter-digital-networks-data-centres-posthuman-institutions/. IDC Digital Universe Report https://www.emc.com/collateral/analyst-reports/idc-the-digital- universe-in-2020.pdf. IDC paper on storage www.storagenewsletter.com/2017/04/05/total-ww-data-to-reach-163-zettabytes-2025idc1for achival_DNA_storage/figures?lo¼1. Claverie, J.-M., Notredame, C., 2007, 2d Edition. Bioinformatics for dummies. Wiley Publishing. Part 4 Becoming a Specialist in Bioinformatics Techniques, 327. Specter, M., August 2016. DNA revolution. The National Geographic. Kaku, M., 2014. The Future of the Mind. Doubleday Publishing. New Trends of Digital Data Storage in DNA www.ncbi.nlm.nih.gov/pmc/articles/PMC5027317/. Optimal family of codes https://www.researchgate.net/publication/271532092_On_optimal_family_of_codes_. Payload graphs https://www.google.com/search?q¼payloadþofþDNAþstrand&tbm¼isch&tbs¼rimg:CV1 e7hDT S5llIjjMxTIDy4-OhVm_1SXZPAKAtSHc-KTan1oZU0TECT_1d6_1kOJDsRFz0wiY36A- VrrsLE6OU_1XxM JkRCoSCczFMgPLj46FEY6Z9MmV65LqKhIJWb9Jdk8AoC0Rjpn0yZXrkuoq EglIdz4pNqf WhhGOmf TJleuS6ioSCVTRMQJP93r-EY6Z9MmV65LqKhIJQ4kOxEXPTCIR1R SFsovZ5f UqEgljfoD5WuuwsRGDDO8WOTOUVSoSCTo5T9f EwmREEY6Z9MmV65Lq&tbo¼u &sa¼X&ved¼2ahUKEwj_xIWH4dDdAhWDA XwKHW2XCOgQ9C96BAgBEBs&biw¼1706&bi h¼813&dpr¼2.25#imgrc¼Wb9Jdk8AoC3-5M. Singularity hub https://singularityhub.com/2018/04/26/the-answer-to-the-digital-data-tsunami-is-literally- in-ourdna/. Storing data in DNA https://phys.org/news/2017-07-dna-nature-digital-universe.html Story board https://www. behance.net/gallery/54826107/CRISPR-National-Geographic. Robinson, T.R., 2010. Genetics. John Wily & Sons. The top largest data center in the world https://www.racksolutions.com/news/data-center./top-10- largest-datacenters-world/. Topol, E., 2019. Deep Medicine. Basic Books Publishing. Use of DNA for computation https://pdfs.semanticscholar.org/4da6/242a92b954d2dfe3943cd6f358bb6 d7a9760. pdf. What does DNA do (amino acid wheel) http://www.yourgenome.org/facts/what-does-dna-do. What is machine language https://www.expertsystem.com/machine-learning-definition/.
CHAPTER
Getting DNA storage on board: starting with data encoding
6
Some mathematical ideas that we all need to know Come forth into the light of things, Let Nature be your teacher. dWilliam Wordsworth (1772).
p is the most famous number in mathematics. Forget all the other constants of nature, p will always come at the top of the list. If there were Oscars for numbers, p would get an award every year. dArchimedes’ constant p.
It was Archimedes of Syracuse who made a real start on the mathematical theory of p in around 225 BCE. He said that the ratio of the circumference to the diameter of a circle was a subject of ancient interest. Around 2000 BCE, the Babylonians made the observation that the circumference was roughly 3 times as long as its diameter. What does Julius Caesar have in common with the transmission of modern digital signals? The short answer is codes and coding. To send digital signals to a computer or a digital television set, the coding of pictures and speech into a stream of zeros and onesda binary codedis essential, for it is the only language these devices understand. Caesar used codes to communicate with his generals and kept his messages secret by changing around the letters of his message according to a key which only he and they knew. Accuracy was essential for Caesar, and it is also required for the effective transmission of digital signals. Caesar also wanted to keep his codes to himself as do the cable and satellite broadcasting television companies who only want paying subscribers to be able to make sense of their signals.
And DNA does that! Imagine that someone gives you a mystery novel with an entire page ripped out. And let us suppose someone else comes up with a computer program that reconstructs the missing page, by assembling sentences and paragraphs lifted from other places in the book. Imagine that this computer program does such a beautiful job that most people cannot tell the page was ever missing. Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00006-9 Copyright © 2020 Elsevier Inc. All rights reserved.
139
140
Chapter 6 Getting DNA storage on board
The plants could reconstruct the damaged section. They did so by copying other parts of the DNA strand, then pasting them into the damaged area. This discovery by the eminent cytogeneticist Dr. Barbara McClintock, in 1940, at Cornell and University of Missouri, was so radical at the time, hardly anyone believed her reports. Forty years
later, she won the Nobel Prize for this work.
And we still wonder: How does a tiny cell possibly know how to do . that? Dr. Jean Claude Perez, French HIV researcher and computer scientist, has found part of the answer. He said “The instructions in DNA are not only linguistic, they’re beautifully mathematical. There is an evolutionary matrix that governs the structure of DNA.” Computers use something called a “checksum” to detect data errors. It turns out that DNA uses checksums too. But DNA’s checksum is not only able to detect missing data; sometimes, it can even calculate what is missing. In DNA, some letters also appear a lot more often (like E in English) and some much less often. But . unlike English, how often each letter appears in DNA is controlled by an exact mathematical formula that is hidden within the genetic code table. When cells replicate, they count the total number of letters in the DNA strand of the daughter cell. If the letter counts do not match certain exact ratios, the cell knows that an error has been made. So, it abandons the operation and kills the new cell. Failure of this checksum mechanism causes birth defects and cancer.
Data nomenclature We would like to provide several definitions before we cover the subject of encoding. These terms are very meaningful and will help in understanding the content below. Address: Each data block is augmented with addressing information to identify its location in the input data string. The address space is in two parts. The high part of the address identifies the key a block is associated with. The low part of the address indexes the block within the value associated with that key. The combined address is padded to a fixed length and converted to nucleotides. A parity nucleotide is added for basic error detection. Denaturation: A process in which proteins or nucleic acids lose the quaternary structure, tertiary structure, and secondary structure that is present in their native state, by application of some external stress or compound such as a strong acid or base, a concentrated inorganic salt, an organic solvent (e.g., alcohol or chloroform), radiation, or heat. Extracted from MERIT CyberSecurity archives Polymerase chain reaction (PCR): A method widely used in molecular biology to make multiple copies of a specific DNA segment. Using PCR, a single copy of a DNA sequence is exponentially
The random access method
141
FIGURE 6.1 An overview of the DNA data encoding format. After translating to nucleotides, the stream is divided into strands. Each strand contains a payload from the stream, together with addressing information to identify the strand and primer targets necessary for PCR and sequencing.
amplified to generate thousands to millions of more copies of the particular DNA segment. Another practical issue with representing data in DNA is that current synthesis technology does not scale beyond sequences of low hundreds of nucleotides. Data beyond the hundreds of bits therefore cannot be synthesized as a single strand of DNA. In addition, DNA pools do not offer spatial isolation (open gap), and so a pool contains data for many different keys that are irrelevant to a single read operation. Isolating only the molecules of interest is nontrivial, and so existing DNA storage techniques generally sequence the entire solution, which incurs significant cost and time overheads. To overcome these two constraints, data in DNA were set up similar to Goldman encoding, as shown in Fig. 6.1. Segmenting the nucleotide representation into blocks, which we synthesize as separate strands, allows storage of large values. Tagging those strands with identifying primers allows the read process to isolate molecules of interest and so perform random access. Payload: The string of nucleotides (base pair or DNA building blocks) representing the data to be stored is broken into data blocks, whose length depends on the desired strand length and the additional overheads of the format. To aid in decoding, two sense nucleotides indicate whether the strand has been reverse complemented (this is done to avoid certain pathological cases). Primers: To each end of the strand, we attach primer sequences. These sequences serve as a “foothold” for the PCR process and allow the PCR to selectively amplify only those strands with a chosen primer sequence. Random access: We exploit primer sequences to provide random access: by assigning different primers to different strands, we can perform sequencing on only a selected group of strands. Existing work on DNA storage uses a single primer sequence for all strands. While this design suffices for data recovery, it is inefficient: the entire pool (i.e., the strands for every key) must be sequenced to recover one value.
The random access method To provide random access, we map the keys to unique primer sequences. All strands for a particular object share a common primer, and different strands with the same primer are distinguished by their
142
Chapter 6 Getting DNA storage on board
different addresses. Primers allow random access via a PCR, which produces many copies of a piece of DNA in a solution. By controlling the sequences used as primers for PCR, we can dictate which strands in the solution are amplified. To read a particular key’s value from the solution, we simply perform a PCR process using that key’s primer, which amplifies the selected strands. The sequencing process then reads only those strands, rather than the entire pool. The amplification means sequencing can be faster and cheaper because the probability of recovering the desired object is higher. Note that not all adapters and primers have the same behavior or effectiveness during PCR. Also, the actual sequences affect the PCR cycle temperatures. Discussing adapter and primer design is outside the scope of this chapter. The hash function that maps addresses to primers can be implemented as a table lookup of primers that are known to work well and have known thermocycling temperatures.
Other existing encoding methods The are several encoding methods that have been renovated and modernized. However, the encoding is still naı¨ve, where each bit of binary data is encoded in exactly one location in the output DNA strands and so relies on the robustness of DNA for durability. We will be talking about more robust design that would provide redundancy at the data encoding level.
Bancroft encoding method Early work in DNA storage used encodings simpler than the one we describe above. For example, Dr. Bancroft translated text to DNA by means of a simple ternary encoding: each of the 26 English characters and a space character maps to a sequence of three nucleotides drawn from A, C, and T (so exactly 33 ¼ 27 characters can be represented). The authors successfully recovered a message of 106 characters, but this encoding suffers substantial overheads and poor reliability for longer messages.
The Huffman encoding method Dr. David Huffman of MIT developed a lossless (information will not be lost during encoding) data compression coding algorithm and works by limiting the number of bits that are required by a character. This type of algorithm is very useful in such fields as bioinformatics, which have a small number of letters but a large amount of data that need to be compressed. Huffman coding: The idea is to assign variable-length codes to input characters where the lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code. Here is an example of how Huffman’s coding works as shown in Fig. 6.2: For that, I have a vector containing binary values such as 01100011 01110010 01111001 01110000 01110100 01101111 Now we could use MATLAB to map it to DNA sequence.
Other existing encoding methods
143
FIGURE 6.2 To go from plain text to compressed text, you would have to do a traversal of the tree and store the path to reach each leaf node as a string of bits (0 for going left, 1 for going right) and associate that bit with the particular character at the leaf. Once this is done, converting a plain text file into a compressed file is just a matter of replacing each letter with an appropriate bit string.
Essentially, a tree is built from the bottom up. We start out with 256 trees (for an ASCII file) and end up with a single tree with 256 leaves along with 255 internal nodes (one for each merging of two trees, which takes place 255 times). The tree has a few interesting properties; the frequencies of all of the internal nodes combined together will give the total number of bits needed to write the encoded file (except the header). This property comes from the fact that at each internal node, a decision must be made to go left or right, and each internal node will be reached once for each time a character beneath it shows up in the text of the document. The Huffman encoding algorithm is an optimal compression algorithm when only the frequency of individual letters is used to compress the data. The idea behind the algorithm is that if you have some letters that are more frequent than others, it makes sense to use fewer bits to encode those letters than to encode the less frequent letters. There are only 52 characters in the alphabet (counting lowercase and capital), 10 numbers (digits 0e9), and some small set of punctuation that is likely to be used. So, the real number of characters used in a typical plain text file is closer to 75 than 256 (for 8-bit encoding) or 2 1016 (for 16-bit encoding). We developed a small program in C to show how Hoffman compression works as shown in Fig. 6.3.
144
Chapter 6 Getting DNA storage on board
FIGURE 6.3 A simple program written in Python to demonstrate how we can take a binary string and convert it to DNA code. If we had a book with 200 pages, its binary code would be over 80,000 binary code. Condensing the stringdlike we do when we zip a filedis recommended, provided it is lossless (no information lost).
The Goldman encoding method Dr. Nick Goldman, from the European Bioinformatics Laboratory, has a different approach to encoding. It splits the input DNA nucleotides into overlapping segments to provide fourfold redundancy for each segment. Each window of four segments corresponds to a strand in the output encoding. Goldman used this encoding to successfully recover a 739-kbits message. This encoding is used as a baseline because it is the most successfully published DNA storage technique. In addition, it offers a tunable level of redundancy, by reducing the width of the segments and therefore repeating them more often in strands of the same length (for example, if the overlapping segments were half as long as in Fig. 6.4, they would be repeated in eight strands instead of four). Extracted from MERIT CyberSecurity archives
Other existing encoding methods
145
FIGURE 6.4 Goldman encoding protocol staggers segments from the input message “739 kB” to create redundancy (by repetition) for each segment.
XOR encoding method Researchers resort to new algorithm with higher reliability called XOR. It is used in mathematics, electrical engineering, and computer science. XOR means exclusive or. Given two inputs, the output is true only if the two inputs are different (or exclusive) from each other as displayed in Fig. 6.5. This means that 1 XOR 0 is 1, but 1 XOR 1 is 0. We can apply this logic to DNA as well. For example, if we say D is represented as 01 and W as represented in 11, C XOR W becomes N. The XOR encoding (represented as A 4 B is true when either A or B, but not both, are true). It provides its redundancy in a similar fashion to RAID 5 (redundant array of independent disks). RAID 5 uses data striping, which consists of the data divided across the set of hard disks. Because striping spreads data across more physical drives, multiple disks can access the contents of a file, enabling writes and reads to be completed more quickly. So, in XOR encoding, any two of the three strands, A, B, and A 4 B, are enough to recover the third.
FIGURE 6.5 This is an example that provides redundancy by a simple exclusive or operation at the strand level. We take the exclusive or A 4 B of the payloads A and B of two strands, which produces a new payload and so a new DNA strand. The address block of the new strand encodes the addresses of the input strands that were the inputs to the exclusive or; the high bit of the address is used to indicate whether a strand is an original payload or an exclusive or strand.
146
Chapter 6 Getting DNA storage on board
The reliability of the XOR encoding method is equivalent to Dr. Goldman’s encoding. Conducting further experiments showed that objects were successfully recovered from both encodings in a wet lab trial. However, the theoretical density of the XOR encoding is much higher than Goldman. Simulation results showed XOR encoding to be twice as dense as that of Goldman.
The tunable (balanced) redundancy method In this method, the level of data redundancy is a function of block granularity. For critical data, we can provide high redundancy by pairingddoubling blocks for securitydcritical blocks with many other blocks: if A block is critical, then exclusive or can be used: A block (or) B block or A block (or) C block. On the other hand, for blocks that are less critical, we can further reduce their redundancy: instead of including only two blocks in an exclusive or, we can include n, such that any n 1 of the n blocks is sufficient to recover the last, at an average density overhead of 1/n. In addition to improving density, tunable redundancy has a significant effect on performance. Both DNA synthesis and sequencing are slower and more error-prone with larger datasets. It is often more practical to synthesize smaller DNA pools with more accurate technology, while larger pools are out of reach. Tunable redundancy allows the storage system to optimize the balance between reliability and efficiency.
Selecting the best encoding method There are many distinctive properties to DNA synthesis and sequencing. Errors in synthesis and sequencing are not uniform and vary according to location within a strand. Also, sometimes undesired reactions are caused by the different sequences of nucleotides. The optimum encoding is the one with the least complex encoding method. Synthesis errors are tied to the length of the strand being synthesized; nucleotides toward the end of the strand are more likely to be incorrect, but easily ignored and truncated in order to meet the correct length of the product strand. An improved version of XOR encoding would tolerate variable substitution errors by not aligning the strands directly. For example, rather than computing A block (or) B block, we might instead compute A block (or) B’ block, where B’ is block A in reverse. Since blocks at the end of a strand are more error-prone, reversing one strand will keep the quality constant along the strand. This will allow each block to have some redundancy information.
DNA storage with random access To demonstrate how well random access works with DNA storage, we experimented with four image files using Goldman and XOR encoding. The files varied in size from 5bits to 84 kbits. We synthesized these files and sequenced the resulting DNA to recover the files. Further analysis will be conducted to reinforce how practical random access works with DNA. A simulator was used to perform more experiments exploring the design space of data encoding and durability.
The simulation method A simulator is used further for DNA synthesis and sequencing. The simulator uses new encodings and new configurations for Goldman and XOR encodings, to answer two questions about DNA storage: first, how do different encodings trade storage density for reliability, and second, what is the effect of decay on the reliability of stored data?
The simulation method
147
A single strand of DNA, called a primer, is used as a starting point for DNA replication. The enzyme that synthesizes DNA molecules is called the PCR. DNA polymerase is an enzyme designed to amplify specific files and to incorporate sequence domains that are necessary for sequencing. Each primer has three sequence domains that attach to the original DNA strand. The first domain has the sequences, which use illumina flow cells, which offer perfect pattern during generation sequencing. The second domain included a region to help the sequencing primer to bind to the original DNA strand. This region allows for the same sequencing to run as the primer region. These sequences were generated using Nucleic Acid Package (NUPACK)dsoftware for thermodynamic analysis of interacting DNA strandsdin order to avoid the formation of a secondary strand that could interfere with the original sequencing reaction. The third domain consisted of a short strand used for detection in the illumina sequencing platform as shown in Fig. 6.6. Extracted from MERIT CyberSecurity archives Thermo Fisher, a biotechnology company involved in the fields of scientific research, genetic analysis, and applied sciences, has a product known as “Platinum PCR Super Mix High Fidelity Master Mix.” It is used for higher fidelity PCR amplification of DNA templates. After the PCR amplification cycling process is done, the next step is the generation sequencing step. Finally, the new product was sequenced using an Illumina biotech company’s MiSeq sequencing system.
FIGURE 6.6 Polymerase chain reaction (PCR) amplification is a method used to make many similar copies of specific DNA segment. Using PCR, a copy of a DNA sequence is exponentially amplified to generate thousands to millions of more copies of the particular DNA segment.
148
Chapter 6 Getting DNA storage on board
Experiment description To demonstrate how DNA storage offers effective random access, we performed four encoding operations: three files encoded with Goldman encoding and the last file encoded with XOR encoding. We used as input four image files (x1.jpg, x2.jpg, x3.jpg, x4.jpg). We generated for each image file its corresponding DNA sequence file (xa.jpg, xb.jpg, xc.jpg, xd.jpg). We used two different encodingsdthe Goldman encoding and our proposed XOR encoding. Combined, the eight operations generated 45,652 sequences of length 120 nucleotides, representing 151 kbits of data: three Goldman encoded files and the fourth file using XOR (4) encoding. PCR amplification is used for sequencing. The product was sequenced using an Illumina MiSeq platform. The selected get operations total 16,994 sequences and 42 kbits. Sequencing produced 20.8 million reads of sequences in the pool. We inspected the results and observed all reads of sequences were selected, so random access was effective in amplifying only the target files.
Experiment results File recovery (back to binary) method We should admit that the language is foreign and little illusive. But once the language is used, it becomes easy to communicate with. The retrieval of the four files was successful. Three of the files were recovered without manual intervention. One filedx3.jpg encoded with the Goldman encoderdincurred a one-byte error in the JPEG header, which we fixed by hand. It is known that the Goldman encoding system provides no redundancy for the first and last bytes of a file, and so this error was due to random substitution in either sequencing or synthesis. There’s a way to eliminate this error by wrapping the redundant strands past the end of the file and back to the beginning. Fig. 6.7 shows, similar to a hyperexponential distribution, the behavior of the sequencing death. As the sequencing technology gets better, the curve will take a different pattern.
FIGURE 6.7 This chart shows the distribution of sequencing depths over the 16,994 selected sequences. The sequencing depth of a strand is the number of times it was perfectly sequenced. We can see that when the number of input blocks increases, there will be a sequence degradation.
The simulation method
149
What is sequencing depth (number of sequence times) Plainly speaking, depth refers to how many times a DNA input base is sequenced in order to generate a clean encoded outputdsequencedassembly. Of the 20.8 million reads from the sequencing run, 8.6 million were error-free reads of a strand in the desired pool. The distribution of reads by sequence is heavily skewed with a mean depth of 506 and median depth of 128. These results suggest that encodings need to be robust not only to missing sequences (which get very few reads) but also to heavily amplified incorrect sequences.
Reduced sequencing depth We define the term coverage with an example: a 30x coverage means that each input DNA base has been read 30 sequence times. The whole purpose of the repetitive sequencing is to generate 100% identical copy of the input. Sequencing technology can reduce sequencing depth in exchange for faster, high-throughput results. To determine whether Goldman and XORdexclusive OR gatedencodings are still effective by reducing the sequencing depth, a random sample of 20.8 million reads were selected, to decode the picture (sydney.jpg) was decoded again, using both the Goldman and XOR encodings. Fig. 6.8 diagram shows how both encodings respond similarly to reduce sequencing times. The x-axis shows the fraction of the blocks that was used for decoding DNA, whereas the y-axis shows how accurate each base was. The result was that reading less input data gave higher accuracy! Also, we found XOR encoding was slightly better than Goldman encoding.
FIGURE 6.8 The x-axis plots the fraction of the 20.8 million reads used, and the y-axis plots the accuracy of the decoded file. Both encodings start with 25% accuracy and then it increases as the repetitive sequence is increased. The accuracy of the two encodings is similar; however, the XOR encoding has higher density than the Goldman encoding.
150
Chapter 6 Getting DNA storage on board
FIGURE 6.9 We use the exclusive OR A 4 B of the payloads A and B of two strands, which produces a new payload and so a new DNA strand. The address block of the new strand encodes the addresses of the input strands that were the inputs to the exclusive or; the high bit of the address is used to indicate whether a strand is an original payload or an exclusive or strand: https://www.cs.utexas.edu/wbornholt/dnastorage-asplos16/.
Naive encoding method Naı¨ve method is, simplistically saying, to see where one checks the input block inside a stream of output blocks. Serially, the checking will progress recording ever match hit. As a simple example, we try to see if there is a copy of the needle in the first character of the haystack; if not, we look to see if there is a copy of the needle starting at the second character of the haystack; if not, we look starting at the third character, and so forth. In the normal case, we must look only at one or two characters for each wrong position to see that it is a wrong position. Fig. 6.9 shows that XOR encoding is a supersetdit includesdnaive encoding. For example, if we ignore the A or B strands, we are left with only the naively encoded strands, and while ignoring the XOR products, there would be a lot of missing strands, and even after enhancing the decoder for better sequencing, we would not able to recover a valid file. The XORdexclusive ORdencoding corrected all these errors at a lower density overhead than the Goldman encoding. These results suggest that even at very high sequencing depths, a naive encoding is not enough for DNA storage. The rule that prevails is that encodings must provide their own resilience to errors.
Comparison of reliability versus density Purpose of the test is to compare Goldman and XOR methods for two parameters: reliability and density. Encoding variables were modified to provide either higher density or higher reliability. To examine this trade-off between different encodings methods, two sample picture files were selected for encoding with different methods. The comparison was performed by analyzing a “test block” from the input strands. Fig. 6.10 shows how the three methods (Goldman, XOR, naı¨ve) performed. Density is defined as Density ¼ Bits per nucleotide/Number of base pairs in the strand
Comparison of reliability versus density
151
FIGURE 6.10 The chart shows the reliability of recovery of data for the three encoding methods as a function of strand density.
We performed three different cases for the three encodings: naı¨ve with no redundancy, Goldman, and XOR. The results are explained as follows:
Test result 1: percent of file recovery as a function of strand density Simple encoding has the lowest reliability because there is no checking for encoding reliability. Goldman is more resilient than XOR because it does not combine the replicate bits; it simply replicates them. As sequencing depths increase, XOR becomes as reliable as Goldman because the probability of having no copies at all of the original data lowers significantly. Fig. 6.10 reveal interesting results.
Test result 2: density as a function of strand length in bits Fig. 6.11 shows the results of simulated data and how much binary data can be encoded in DNA versus DNA string length. Normally we can work with DNA string length of 200 nucleotides. If we were to translate binary data directly to DNA by translating 00 ¼ A, 01 ¼ C, and so on, then the density (bits/db) will be 2. We found that XOR encoding can achieve higher density than Goldman encoding, which means less DNA is required to encode the same amount of binary data. Researchers concluded that this makes XOR encoding a more attractive option than Goldman encoding.
152
Chapter 6 Getting DNA storage on board
FIGURE 6.11 The chart shows the three curves as a function of their nucleotide (block) length.
Test result 3: desired reliability as a function of time span and number of copies needed to achieve the desired reliability Finally, we used the simulator to evaluate the durability of DNA storage over time. The durability of electronic storage is, in the best case, in the tens of years. In contrast, the half-life of single-stranded DNA is much longer. To demonstrate the durability of DNA storage, we simulated decay at room temperature. Storage at lower temperatures significantly increases the half-life of the stored DNA. Fig. 6.12 shows an interesting behavior that very few copies of DNA strands will offer high reliability far beyond the half-lives of any electronic device. The corollary of this is that high reliability is independent from the time interval.
DNA the Rosetta stone
153
FIGURE 6.12 The chart shows desired time span on the x-axis and the number of copies of each strand required to achieve a desired reliability after the time span on the y-axis. Different curves correspond to different desired reliabilities. For example, the 99.99% reliability line says that to have a 99.99% chance of recovering an encoded value after 100 years, we need to store only 10 copies of the strands for that value. The results show that even very few copies are enough to provide high reliability long beyond the half-lives of existing electronic media. In summary, high reliability for long time intervals does not have much impact on DNA storage density. Source: https://www.cs.utexas.edu.
DNA the Rosetta stone CRISPR is a kind of genetic memoryda system for storing information. And that information does not have to be the DNA of viruses. Scientists can now encode any digital file in the form of DNA, by converting the 1s and 0s of binary code into As, Cs, Gs, and Ts of the double helix. This is a way to show off CRISPR’s power to turn living cells into digital data warehouses. If you were Jeff Nivala, a Postdoctoral Fellow and geneticist at Harvard Medical School, it is not to preserve visual messages for people in the far-off future. It is so he can turn human cells like neurons into biological recording devices. Human DNA has approximately 3 billion base pairs, according to the National Human Genome Research Institute. That means 43,000,000,000 possible base sequences. Most humans have between 20,000 and 25,000 genes. Let us say the average is about 222,500 more choices. The length of DNA varies for different species. Humans, with about 3 billion base pairs, have neither the largest nor the smallest genome. Therefore, human DNA genome has 43,000,000,000 ¼ 26,000,000,000 choices to encode, or 6 billion bits of information. The
154
Chapter 6 Getting DNA storage on board
epigenome (tells DNA what to do) encodes at least 222,500 choices, or 22,500 bits. The total information is then 6,000,000,000 þ 022,500 ¼ 6,000,022,500 bits, or approximately 6 Gb (gigabits). We usually discuss computer storage in bytes rather than bits. 6 Gb would amount to 6/7 ¼ 0.857 GB (gigabytes), or 857 MB (megabytes), using ASCII code. We have an 8-bit binary sequence. I need to encode this 8-bit binary sequence into a DNA sequence. We have 10011100, the followed encoding rule is A [ 00; T [ 11; G [ 10; C [ 01. So, we want it to be something like GCTA. Therefore, we need a 4-bit DNA sequence as result. DNA synthesis is the process whereby deoxynucleic acids (adenine, thymine, cytosine, and guanine) are linked together to form DNA. 20 amino acids are encoded by combinations of 4 nucleotides. If a codon were two nucleotides, the set of all combinations could encode only 4 4 ¼ 16 amino acids. With three nucleotides, the set of all combinations can encode 4 4 4 ¼ 64 amino acids (i.e., 64 different combinations of four nucleotides taken three at a time). DNA is nature’s hard drive, capable of storing, replicating, and transmitting massive amounts of information. Researchers in New York found a way to use DNA like an actual computer hard drive, successfully storing, replicating, and retrieving several digital files. A pair of scientists from Columbia University and the New York Genome Center selected five filesdincluding a computer operating system and computer virusdand compressed them into a master file. They transcribed the master file into short strings of binary code, combinations of 1s and 0s. The Bioinformatician Jeena Lee presented a paper titled “A DNA-Based Archival Storage System” and concluded in her report that “The researchers also calculated the number of data copies required for a given length of storage time and the level of reliability wanted. DNA has a half-life of 500 years, which roughly means that if we have two copies of DNA, one copy will degrade over the course of 500 years. They found that if we want 99.99% reliability and want to store the data for 100 years, then we only need 10 copies of the DNA encoded data. Given the high density of DNA (1 exabyte/1 mm3), storing ten copies will take up little space”.
Silicon is getting scarce What is silicon? Its chemical symbol in the periodic table is Si. Being a tetravalent (a measure of its combining power with other atoms) metalloid (a type of chemical element that has properties in between, or that are a mixture of, those of metals and nonmetals), silicon is less reactive than its chemical analog carbon. It is the second most abundant element in the Earth’s crust, making up 25.7% of it by weight. Silicon is a very useful element that is vital to many human industries. It is used frequently in manufacturing computer chips and related hardware. Because silicon is an important element in semiconductor and high-tech devices, the high-tech region of Silicon Valley, California, is named after this element. For half a century, silicon has been the semiconductor industry’s lifeblood. The processor in your computer and the memory in your phone are embedded in silicon chips. But big problems are emerging. While silicon is one of the world’s most common elements, most of it is found in impure forms. The supply of silicon pristine enough to make into chips is dwindling. This threatens the future of digital memory, at a time when memory usage is growing fast, and as usage grows, so does demand for electricity, especially at power-guzzling datacenters. One interesting revelation that sounds just short of sci-fi quality about silicon is “In 2006, researchers announced they had created a computer chip that melded silicon components with brain cells. Electrical signals from the brain cells could be transmitted to the electronic silicon components of the chip, and vice versa. The hope is to eventually create electronic devices to treat neurological disorders.”
Artificial gene synthesis
155
Artificial gene synthesis Artificial gene synthesis, sometimes known as DNA printing, is a method in synthetic biology that is used to create artificial genes in the laboratory. Therefore, it is possible to make a completely synthetic double-stranded DNA molecule with no apparent limits on either nucleotide sequence or size. In archiving digital data in DNA, making artificial DNA comes in handy. The term “oligonucleotide” or “oligo” usually refers to a synthetic laboratory-made DNA or RNA strand. Oligonucleotides are used in biochemistry, biology, molecular diagnostics, genomics, and other molecular biology experiments. Fig. 6.13 shows an interesting conversion table that is used in DNA storage and in encryption of stored data.
FIGURE 6.13 A conversion table showing five standard numbering systems from 0 to 64. The quaternary, or base-4, numeral system is a number system that utilizes only the four (4) digits: 0, 1, 2, and 3, to represent numbers.
156
Chapter 6 Getting DNA storage on board
Amazon’s flying warehouses According to a recent press report, Amazon has recently filed for a patent on “flying warehouses”dessentially kind of airborne fulfillment centers that would be stocked with a certain amount of inventory and positioned near a location where Amazon predicts demand for certain items will soon spike. Good idea, right? Flying warehouses make all sense for Amazon as they appear to be taking drone delivery seriously. Now imagine that instead of the datacenter hosting the virtual machine(s) being in another city, it is actually flying right above the city where the car is. The most obvious challenge is how effective the data are transmitted and at what rate. The simplest and probably the most likely way of flying datacenters transmitting data will be through beaming via electromagnetic waves, like satellites and cell phones.
Church’s DNA storage Dr. George Church, the chemist who invented DNA encoding, has stored 70 billion copies of his book, Regenesis, in a drop of synthetic DNA smaller than the period at the end of this sentence. Under ideal conditions, says Church, those books will last 700,000 years: to give a sense of that time scale, the first printed book, the Gutenberg Bible, was produced just 560 years ago.
DNA computingdthe tables turned DNA computing performs all the computations using biological molecules, rather than traditional silicon chips. It is a fascinating idea to store binary (0 and 1) in the four-character genetic alphabet (A [adenine], G [guanine], C [cytosine], and T [thymine]). The invention dates to 1959, when American physicist Richard Feynman presented his ideas on nanotechnology. However, DNA computing was not physically realized until 1994, when American computer scientist Leonard Adleman showed how molecules could be used to solve a computational problem. DNA computing is an incredible technological tsunami that will sweep awaydin a decadedthe present computer technology and will bring a new dimension in informatics hardware and software. Fig. 6.14 shows the infrastructure of DNA and portfolio of the most innovative applications that will mature in the next few years. DNA has become a platform of many technologies in the 21st century, which will have great societal and business impacts. A “molecular program” was given for breaking the US Government’s Data Encryption Standard (DES). DES encrypts 64-bit messages and uses a 56-bit key. Breaking DES means that given one (plain text, cipher text) pair, we can find a key that maps the plain text to the cipher text. A conventional attack on DES would need to perform an exhaustive search through all of the 256 DES keys, which, at a rate of 100,000 operations per second, would take 10,000 years. In contrast, it was estimated that DES could be broken by using molecular computation in about 4 months of laboratory work. The problems mentioned above show that molecular computation has the potential to outperform existing computers. One of the reasons is that the operations molecular biology currently provide can be used to organize massively parallel searches. It is estimated that DNA computing could yield tremendous advantages from the point of view of speed, energy efficiency, and economic information storing. For example, in Adelman’s model, the number of operations per second could be up to
DNA is the new supercomputer
157
FIGURE 6.14 The 3D view of DNA computing ecosystem. Our digital universe will connect on day with DNA storage and archiving. DNA computing will proliferate and there will be many low-hanging applications to solve complex engineering, medical, and business problems. In the next decade, all the blocks of DNA applications will multiply and diversify, and computer science and biomedical engineering will happily merge.
approximately 1.2 1018. This is approximately 1,200,000 times faster than the fastest supercomputer. While existing supercomputers execute 109 operations per Joule, the energy efficiency of a DNA computer could be 2 1019 operations per Joule, that is, a DNA computer could be about 1010 times more energy efficient. Finally, storing information in molecules of DNA could allow for an information density of approximately 1 bit per cubic nanometer, while existing storage media store information at a density of approximately 1 bit per 1012 nm3. As estimated, a single DNA memory could hold more words than all the computer memories ever made.
DNA is the new supercomputer Scientists have used DNA molecules to create a new, superfast computer that is capable of “growing as it computes.” The Journal of the Royal Society Interface (https://www.mub.eps.manchester.ac.uk/inabstract/computing-exponentially-faster-implementing-a-non-deterministic-universal Turing machineusing-dna/) described the happy news that research has conclusively shown the feasibility of a nondeterministic universal Turing machine or NUTM. Until now, such a computing entity existed only in theory. Dr. Ross D. King, a professor of computer science at the University of Manchester, announced the advantage of the DNA supercomputer by saying that the Hamiltonian path problem can be solved on a DNA supercomputer 1000-fold faster than the electronic computersdsimply because it can replicate itself and follow both paths at the same time, thus finding the answer faster. The Hamiltonian path problem is also a special case of the traveling salesman problem, which is where one is given a list of cities and the distances between each pair of cities. The algorithm will find the shortest possible route to visit each city and return to the origin city. Unlike electronic computers, which rely on a fixed number of silicon chips, the new NUTM-like device utilizes DNA, which can replicate. No ordered operations or communication is necessary in the new computerdthe DNA is edited or preprogrammed to replicate and carry out an exponential number of computational paths.
158
Chapter 6 Getting DNA storage on board
Scientists have learned that sequences of molecules called nucleotides in DNA can store other information besides a living organism’s genetic heritage. Will Hughes, Director of the Micron School of Materials Science and Engineering, said in an interview “The calculations we’ve done are that DNA is 1000 times denser than flash memory. It can store information for thousands to millions of years, depending on how it’s stored. The energy of operation is 100 million times less than all electronic and magnetic memory today. All the information that we have can fit into a 10-by-10-by-10-centimeter box of DNA.” This is about the same size as a 12-ounce can of peanuts.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library) Allele: One of two or more forms of a particular gene at a specific position on a chromosome. Allelic heterogeneity: Different mutations in the same gene can lead to the same disease. For example, duchenne muscular dystrophy is caused by large deletions in the dystrophin gene 70% of the time but other mutations in this dystrophin gene can lead to the same disease presentation. Alternative splicing view: The process where the initial strand of RNA copied directly from DNA is cut up (spliced) and pieced together into different messages (mRNAs) and can lead to different proteins from the same region of DNA. Thus, one gene can be the source of several proteins that differ in function or time and place of action. Amino acid view: The building blocks for proteins. Composed primarily of carbon, hydrogen, oxygen, and nitrogen, these organic molecules all contain a basic amino group (NH2) and an acidic carboxyl group (COOH), and hence the truly logical moniker, amino acid. To keep on living and loving, we humans need 20 of them. Each amino acid contains a unique side chain that varies from the single hydrogen atom of glycine to the sulfur-containing side chains of cysteine and methionine. Our bodies make 12 and we get the other 8 (the 8 essential amino acids) from our food. Amniocentesis: A prenatal procedure that can examine the DNA of a fetus. It involves removing a small amount of fluid from around the developing baby (amniotic fluid) with a needle into the uterus. The amniotic fluid has some of the baby’s cells that can undergo various lab tests. It can diagnose genetic disorders based on chromosomes such as Down syndrome or conditions with smaller gene mutations like cystic fibrosis. Amniotic fluid: The fluid surrounding the fetus in the uterus. It is mostly water but also has cells sloughed off from the fetus, secretions from the placenta, and fetal urine. These fetal cells can be examined for chromosomal and genetic disorders. Anatomy: Study of the structure of plants and animals and determined by dissection. Apoptosis: The body’s method for disposing of unwanted cells. Otherwise known as programmed cell death, apoptosis is a deliberate “suicide” where a cell dies in an organized way. This contrasts with necrosis, the “messy and violent” form of cell death. Apoptosis is important when there are a lot of changes, such as during fetal development. Assay: A laboratory method used to quantitatively determine the amount of a given substance in a particular sample. Autoimmune disease: Disease in which the immune system confuses the patient’s own cells for foreign cells and attacks them.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
159
Autosome: The chromosomes other than the sex chromosomes. Human beings have 22 pairs of autosomes (chromosomes 1e22) and two sex chromosomes, two Xs if you are female and XY if you are male. Backcross: This is a process used in research to create a plant/animal that is genetically very similar to one of the parents. Offspring are mated with one of their parents or a plant/animal that is genetically similar to the parents to create a plant/animal with similar genetic backgrounds. Bacterium: A member of a large group of unicellular microorganisms that have cell walls but lack organelles and an organized nucleus. Base pair view: A base is the variable component of a nucleic acid. DNA contains the four bases thymine (T), cytosine (C), adenine (A), and guanine (G). These bases form base pairs along the length of the DNA helix. A pairs with T, while G pairs with C. The order of these bases (ATCG) composes the DNA sequence. Bioinformatics: The science of managing and analyzing biological information, usually with advanced computing techniques. BLAST view: BLAST stands for basic local alignment search tool and is a computer program that compares DNA or protein sequences. When you enter your favorite sequence, BLAST searches a database of biological sequences for similar sequences. This allows researchers to identify similar genes in the same or other organisms. BRCA1/BRCA2: Breast cancer 1 and 2 genes are examples of tumor suppressors that normally help to restrain cell growth. When these genes are mutated and inactivated, cells can grow too much and cancer develops. Someone who has inherited a mutation in BRCA1/BRCA2 has an increased risk of breast (for both women and men), ovarian, or prostate cancer. Cancer: Malignant, ill-regulated proliferation of cells causing either a solid tumor or other abnormal conditions, usually fatal if untreated. Cancer cells are abnormal in many ways, particularly in their ability to multiply indefinitely, to invade underlying tissue, and to migrate to other sites in the body and multiply there (see Metastasis). Candidate gene: A gene suspected to cause or contribute to a disease. The gene’s function and place of action should suggest a role in the disease in question. For example, the gene coding for a protein that works in the brain and regulates neuron transmitters may be a good candidate for causing depression. Carbohydrate: A sugar molecule. Carbohydrates can be small and simple (for example, glucose), or they can be large and complex (for example, polysaccharides, such as starch). Carrier (of a genetic disease): Someone who has a disease-causing mutation in their DNA but does not show any symptoms. One way this happens is in recessive disorders when someone has one healthy and one faulty version of the disease-causing gene. Cell view: The smallest unit of life that can exist independently. All organisms are made up of one or more cells. Centromere: A region on a chromosome where the kinetochore assembles and at which a chromosome becomes attached to the microtubules of the spindle during mitosis or meiosis. Chimera: An organism developing from an embryo composed of cells from two different individuals and therefore composed of cells of two different genotypes. Compare to Mosaic. Chorionic villus sampling (CVS): A prenatal procedure that cultures tissue that can be used to examine the DNA of fetal cells from the placenta (chorionic villi). CVS is commonly used to diagnose chromosome conditions like Down syndrome.
160
Chapter 6 Getting DNA storage on board
Chromatin: A complex of macromolecules found in cells, consisting of DNA, protein, and RNA. The primary functions of chromatin are (1) to package DNA into a smaller volume to fit in the cell, (2) to reinforce the DNA macromolecule to allow mitosis, (3) to prevent DNA damage, and (4) to control gene expression and DNA replication. Chromosome view: A long, twisted, and folded-up piece of DNA. Each species has a characteristic number of chromosomes. Humans, for example, have 46 chromosomes, 23 contributed by each parent. Cleft lip: Results from both genes and environment and causes a visible split of the upper lip. Clone: An organism, either single celled or multicellular, that has the exact same DNA as another organism. For example, plant cuttings are clones. Codominant: Two different versions of a gene contribute to the final trait so that neither version is masked by the other. For example, if a woman with blood type A passed on the A allele and a man (with blood type B) passed on a B allele, their child would be AB. Codon view: Three adjacent bases (letters) in mRNA that “code” for a particular amino acid in protein translation. There are 64 possible three-letter combinations of the bases but only 20 amino acids, so several of the codons code for the same amino acid. Complementary: A property of nucleic acids, whereby adenine (A) always pairs with thymine (T) while cytosine (C) always pairs with guanine (G). Two strands of DNA that pair perfectly are called complementary. Congenital: Referring to a trait in an organism that is present at its birth. The congenital trait can be due to genetic factors, environmental ones, or a mixture of both. Conserved genes: Genes similar across species, that is, conserved through evolution. So they are likely well-used and necessary for survival. For example, all organisms have very similar genes handling DNA through cell division. Covalent bond: A chemical bond formed by the sharing of electrons between two atoms. CRISPR-Cas9: CRISPR stands for clustered regularly interspaced short palindromic repeat and is a customizable tool that lets scientists cut and insert small pieces of DNA at precise areas along a DNA strand. The tool is composed of two basic parts: the Cas9 protein, which acts like the wrench, and the specific RNA guides, CRISPRs, which act as the set of different socket heads. These guides direct the Cas9 protein to the correct gene, or area on the DNA strand, that controls a particular trait. Crossing over: A process that occurs during meiosis. Two members of a chromosome pair twist around one another and exchange genetic information. Genetic material is “shuffled” so that each gamete contains a unique combination of its parent’s genes. It is how, for example, our grandparents’ traits are shuffled; we can have our maternal grandmother’s eyes but our grandfather’s hair color. Cystic fibrosis (CF): An inherited condition in humans caused by a recessive genetic defect. CF is characterized by the buildup of a thick, sticky mucus that can damage many of the body’s organs. The disorder’s most common signs and symptoms include progressive damage to the respiratory system and chronic digestive problems. Diploid: The characteristic of an organism or cell having two complete sets of chromosomes in each cell. For example, humans have one set of chromosomes from dad and the other from mom; hence, they also have two alleles of each gene. Compare to Haploid view. DNA fingerprint view: A technique used for identification, for example, in paternity tests or crime scene investigations. Relatively large amounts of DNA are isolated and cut by restriction enzymes. The lengths of the resulting DNA fragments are characteristic of each individual (see Single nucleotide polymorphism). Note: this technique has been essentially replaced by other types of DNA profiling.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
161
DNA methylation: A biochemical process involving the addition of a methyl group to the cytosine of DNA nucleotides. DNA methylation stably alters the expression of genes in cells. DNA polymerase: An enzyme that can make new DNA strands from an old strand/template. There are several DNA polymerases in our cells; although they have slightly different functions, they all are involved in DNA replication. DNA profiling: A technique used to identify a person based on his or her DNA. Fragments of the DNA are amplified using PCR and the lengths of the resulting DNA pieces are characteristic of each individual. Because of the PCR step, only very small amounts of DNA are needed for DNA profiling. Compare to DNA fingerprint view. DNA replication: The process of producing two identical replicas from one original DNA molecule. This biological process occurs in all living organisms and is the basis for biological inheritance. DNA view: DNA stands for deoxyribonucleic acid. It is the molecule that contains the genetic instructions to construct and maintain a living organism. Dominant: Refers to the allele that is expressed when two different alleles are found together. For example, in the wet and dry alleles for ear wax consistency, wet is the dominant one. So when someone has one wet and dry allele, he or she has wet ear wax. Compare to Recessive. Down syndrome: Also called trisomy 21. A chromosome defect that is characterized by the presence of an extra copy of chromosome 21 in a patient’s cells. Down syndrome causes delayed mental and physical development and is associated with characteristic facial features. Drug target: A key biological molecule that is specific to a disease. A successful drug will block or inhibit this molecule from promoting the disease process. Embryo: Multicellular animal or plant before it is fully formed and capable of independent life. It develops into a free-living miniature adult or larva, in animals, or germinates into a seedling, in plants. Enzyme view: A biological catalyst: a protein that can speed up chemical reactions without getting chemically changed itself. Human saliva contains an enzyme called amylase that speeds up the chemical reaction of converting starches in our food to sugars. Epidemiology: The study of the incidence and distribution of diseases and of their control and prevention. Epigenetics: Processes by which modifications in a gene function that can be inherited by a cell’s progeny occur without a change in the DNA nucleotide sequence. Such modifications include DNA methylation, heterochromatic formation, genomic imprinting, and X chromosome inactivation. Epigenomics: The study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. Epistasis: The phenomenon that genes can interact with each other. A gene is epistatic when it changes the effect another gene conveys on the organism. Eukaryote: An organism that consists of a cell or cells that have membrane-bound organelles. One of these organelles is the nucleus in which DNA is stored. Compare to Prokaryote. Exome sequencing: Exome sequencing (also known as whole exome sequencing or WES) is a technique for sequencing all the protein-coding genes in a genome (known as the exome). The exome comprises about 1% of the genome and is, so far, the component most likely to include interpretable mutations that result in clinical phenotypes.
162
Chapter 6 Getting DNA storage on board
Exon view: The region of RNA that is translated into protein. In eukaryotes, the primary RNA usually contains several exons that are separated by introns. Before the RNA is translated, the introns are removed and only exons remain. Compare to Intron. Expressivity (variable expressivity): The process of the same mutation leading to different disease manifestations. For example, the same mutations in a breast cancer gene (BRCA1) can lead to breast or ovarian cancer. Fetal ultrasound markers: A variation of normal fetal anatomy detected on ultrasound. Soft markers are not abnormalities and are seen in healthy, normal pregnancies, but when a soft marker is seen, it increases the risk that there is a chromosome problem in the baby. For example, extra fluid behind a baby’s neck (called “nuchal translucency”) can be seen in healthy, unaffected babies but is more common in babies with a chromosome problem, such as Down syndrome. Gamete: A haploid reproductive cell produced by meiosis that then fuses with another of the opposite sex or mating type to produce a diploid zygote. Gel electrophoresis: A technique to separate biological molecules such as DNA or protein based on their size. An electrical gradient is applied to a semisolid matrix (the gel) and molecules begin to migrate based on their electrical charge. Because of their smaller size, short pieces of DNA or protein travel through the gel faster. Gene editing: A type of genetic engineering in which DNA is inserted, replaced, or removed from a genome using artificially engineered nucleases, or “molecular scissors.” Gene expression: The “switching on” and transcription of a gene into RNA. Sometimes this RNA can be functional on its own, but usually it is converted into a protein. Either way, the gene’s information is out there to impact the individual’s phenotype. Gene family: A set of genes similar in DNA sequence. Their similarity is presumably due to their evolution from a single ancestral gene that duplicated. Although these gene products gained or lost functions over time and became slightly different from their common ancestor, remnants of the ancestral gene are still present in all family members. Gene knockout: The intentional removal or disruption of function of a gene from an organism’s DNA. For example, a P53 knockout mouse lacks the gene encoding for the P53 protein. Knockout mice are often used to study the function of the knocked-out gene by noting what differences are observed in the mutated mouse relative to a normal mouse. Gene therapy view: The process of replacing “faulty” genes involved in inherited diseases with “corrected” versions. For example, Glybera is a gene therapy designed to restore the enzymatic activity of lipoprotein lipase, a protein required to enable the metabolism of fat from fat-carrying particles made in the intestine following a meal. Gene view: The fundamental unit of inheritance. Genes encode messages for the synthesis of proteins and functional RNAs. Genes help determine an organism’s appearance, its metabolism, and may interact with the environment to influence its behavior. Genetic counselor: Health professionals who provide information and support to families at risk for or affected by a genetic condition. Genetic counselors translate complex genetic information into everyday language to help people make informed decisions about the issue at hand. They strive to discuss information and present options in a nonbiased way to encourage patients to make decisions fitting with their own personal values and beliefs.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
163
Genetic engineering: The direct manipulation of an organism’s DNA, in contrast to traditional breeding, where the DNA is manipulated indirectly. Many different techniques are used in genetic engineering, including the examples of gene knockouts and transgenic organisms. Genetic marker: A segment of DNA that can be tracked from one generation to the next. Markers can be entire genes or a single letter of code (see Polymorphism). Genetic testing: Using tests to diagnose or determine the predisposition to a genetic disease. The type of test can vary and includes tests directly on DNA, as well as biochemical tests that analyze proteins or metabolites linked to genetic diseases. The tests can also be used to prove paternity. Genetically modified organism (GMO): Organisms that have had their DNA altered by genetic engineering. When scientists generate GMOs, they combine existing pieces of DNA in new ways to give an organism new characteristics. Genome: An organism’s complete set of DNAdbasically a blueprint for an organism’s structure and function. Genomics: The science that aims to decipher and understand the entire genetic information of an organism (i.e., plants, animals, humans, viruses, and microorganisms) encoded in DNA and corresponding complements such as RNA, proteins, and metabolites. Broadly speaking, this definition includes related disciplines such as bioinformatics, epigenomics, metabolomics, nutrigenomics, pharmacogenomics, proteomics, and transcriptomics. Genomic imprinting: When the expression of the maternally derived or the paternally derived allele of a gene is suppressed in the embryo. Gene inactivation is correlated with increased DNA methylation of the gene. Genotype: The specific set of alleles contained in the DNA of an organism. The genotype, as well as environmental and epigenetic factors, determine the final traits. Compare Phenotype. Germline: The group or line of cells that gives rise to reproductive cells (sperm or eggs). Mutations in the germline are passed on to future generations. Cells that are not part of the germline are called somatic cells. Haploid view: The characteristic of an organism or cell having only one set of chromosomes and therefore only one allele of each gene. Human sperm and egg cells, for example, are haploid. When they combine during fertilization, they form a diploid (containing two sets of chromosomes) embryo. Compare to Diploid. Heritability: An expression of how much of the variation in a trait in a population is due to genes as compared to how much is due to environment. If a trait has a high heritability, it generally means that genetic factors strongly influence the amount of variation. Heterochromatin: Densely coiled chromatin that appears in or along chromosomes. Heterozygous: An individual having two different alleles of the same gene. Cystic fibrosis (CF) is an example of an inherited condition in humans caused by a recessive genetic defect. The parents of an individual with CF carry one disease allele (recessive) and one functional allele (dominant). Hence, the parents are heterozygous for the trait manifesting in CF. Compare to Homozygous. Histone: A group of structural proteins that act as spools around which DNA winds. Histopathology: Describes the microscopic examination of changes in tissue anatomy caused by disease processes. Homologous: From “Homology”: the characteristic of genes or organisms being similar due to a shared ancestry. Homology can be subdivided into orthology and pharology.
164
Chapter 6 Getting DNA storage on board
Homozygous: An individual having two identical alleles of a particular gene. Albinism is a common example of a recessive trait in which there is an absence of pigmentation in animals that are normally pigmented. Recessive traits only appear if an individual is homozygous for the corresponding gene. Compare to Heterozygous. Hot spot: A sequence with a very high frequency of recombination (recombination hot spot) or mutation (mutation hot spot). Housekeeping gene: A gene whose product is essential for most cells. For example, anything needed for RNA transcription, DNA replication, or metabolic processes. Human genome project view: An international effort to map and sequence all human genes. The project first began in 1990 and was declared completed in 2003, although work continues on certain aspects. The motivation behind the project was that sequencing and identifying all human genes would help us to better understand the genetic roots of disease and find ways to diagnose, treat, and perhaps prevent many diseases. Huntington’s disease view: An inherited brain disorder where there is a gradual loss of brain cells that causes symptoms like movement and walking problems, personality changes, and cognitive impairment. It is inherited in a dominant pattern, which means that an individual who has Huntington’s disease has a 50% (1/2) chance of passing the faulty gene to their child. In vitro: Biological processes and reactions occurring in either (1) cells or tissues grown in culture or (2) cell extracts or synthetic mixtures of cells components. Compare to In vivo. In vitro fertilization (IVF): A reproductive technology where sperm fertilize the egg in a laboratory dish and the fertilized eggs are put into a uterus for pregnancy to be established. In vivo: A process occurring in a living organism. Compare to In vitro. Incidence: The number of new cases of a condition that occur over a defined period of time. For example, the incidence of the birth defect “club foot” is about 1 in 1000 live births. Compare to Prevalence. Inherited: Passed down from parents to offspring through generations. The genes present in the parents are passed down to their offspring through egg or sperm cells. Traits may be inherited in different patterns such as X-linked inheritance, autosomal recessive, or autosomal dominant inheritance. Innate immunity: The basic resistance to disease or the first line of defense against infections. It is nonspecific and does not depend on previous exposure. Our skin is an important part of our anatomy and skin’s physical barrier and low pH inhibits microbial growth. Our body temperature can also be involved by inhibiting the growth of invaders and a fever will kill even more microorganisms. Intracytoplasmic sperm injection (ICSI) view: Intracytoplasmic sperm injection is a type of assisted reproductive technology (ART) that involves injecting a single sperm directly into the cytoplasm of the egg. This technique is often used to help overcome male infertility, in particular when the man has slow-moving, abnormally formed, or low levels of sperm. Intron: An area of DNA that is transcribed into RNA but does not code for a protein. Not all parts of RNA leave the nucleus, and introns are the fragments that are cut out. Junk DNA: An older (and now rarely used) term referring to noncoding DNA. These DNA segments were thought to be useless because their function was unknown; however, we are now beginning to understand that most of our “junk” DNA is actually an important and crucial part of our genetic makeup.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
165
Karyotype view: The characteristic set of chromosomes for a particular species. Chromosome number, shape, and size all determine the species’ karyotype. For example, humans have 46 chromosomes: two pairs each of chromosomes 1e22 and a pair of sex chromosomes. Kinetochore: A region at the centromere of a chromosome to which microtubules attach during meiosis or mitosis. See Spindle. Linkage: The tendency of two or more genes on the same chromosome to be inherited together. The closer two genes are on a chromosome, the more likely it is that they will be inherited together. Lipid: A diverse class of compounds found in all living cells whose main biological functions include storing energy, cellular signaling, and acting as structural components of cellular membranes. Lipoprotein: An assembly of both proteins and lipids that functions to transport water-insoluble fats throughout the vascular and lymphatic systems. Lipoprotein lipase (LPL): An enzyme that plays a critical role in transporting fats and breaking down fat-carrying molecules called lipoproteins. Macromolecule: General term for protein, nucleic acid, or polysaccharide or other very large polymeric organic molecule. Alternatively, macromolecular. Meiosis view: Cell division that results in cells that have half of the chromosomes of the original parent cell. Meiosis is the mechanism by which gametes are produced. Compare to Mitosis. Mendel, Gregor: A monk who lived from 1822 to 1884 and developed the first understanding of the basics of inheritance in sexually reproducing organisms by conducting crossing experiments on pea plants. He provided a collection of experimental observations that were translated into generally applicable rules describing how some traits are transferred between generations. Mendelian inheritance: A set of principles of inheritance derived from the work of Gregor Mendel. In short, these principles state that alleles separate into gametes such that each gamete contains only a single copy of a gene (segregation). Furthermore, the alleles of different genes separate into gametes independently and do not sort based on the inheritance of other genes (independent assortment). Messenger RNA (mRNA) view: An RNA molecule that carries the message that acts as a template for translation into protein. Metabolism: The complete set of chemical reactions that happen in an organism. Includes the conversion of food into energy, production of components for new cells, and maintenance of existing cells. Metabolite: Any chemical compound involved in or is a product of metabolism. Metabolomics: The study of the set of metabolites present within an organism, cell, or tissue. Metastasis: The migration of cancer cells to colonize tissues and organs other than those in which they originated. Microarray (DNA) view: A slide or membrane with small bits of DNA of known sequence fixed to it. Allows detection of genetic sequences by complementary binding of unknown DNA samples that are being tested. Microbe: A microscopic organism; a microorganism. Includes bacteria, some fungi, protozoa, and viruses. Microorganism: Any of various microscopic organisms, especially a bacterium or virus. Microtubules: Fine hollow protein tubes involved in the intracellular transport of materials and movement of organelles. Microtubules also form the mitotic and meiotic spindles.
166
Chapter 6 Getting DNA storage on board
Mitochondria view: An organelle found in the cytoplasm of most eukaryotic cells and responsible for generating the energy for the cell. Mitochondrial DNA (mtDNA): A circular DNA molecule found inside each eukaryotic mitochondrion. Codes for most of the proteins needed by the mitochondrion. Mitosis view: Cell division that produces two daughter cells with chromosomes, and therefore DNA that is identical to the parent cell. Mitosis is the mechanism by which somatic cells are produced. Compare Meiosis. Molecular chaperones: Proteins that assist in the correct folding or transport of other proteins. This helps to make sure proteins and macromolecular structures are assembled properly and in the right place so they are useful to the cell. Monosomy view: The condition of missing one chromosome of a pair. In humans, this is usually lethal, except in Turner’s syndrome where women have one X chromosome instead of the usual two X chromosomes (occasionally also used to refer to missing a part of a chromosome). Mosaic: An individual or tissue containing at least two cell lines that differ genetically but that have been derived from the same fertilized egg. Compare Chimera. Mouse model: A system that uses mice as the experimental organisms to answer a question about biology. Mice are used because their genomes have been well studied. There are currently many mouse models available for research that mimic human diseases like obesity, cancer, and neurological conditions like Huntington’s disease. Multifactorial: Refers to a biological or physiological observation that is attributable to many factors. For example, some traits are determined by the interplay between genes and environment. Most common traits like skin color and height are multifactorial. Although your genes give a rough estimate of your height, environment can also impact your height because if you do not have adequate nutrition during childhood, you will likely not reach your height potential. Mutation: A change in the DNA sequence of an organism that can have no effect or be either beneficial or harmful. Necrosis: Localized and premature death of cells in an organ or tissue due to disease or injury. Compare to Apoptosis. Nucleic acid: A large organic molecule made up of a chain of nucleotides. Examples include DNA and RNA. Nucleoside: Organic compound made up of a purine or pyrimidine (base) joined to a sugar. Nucleosides are structurally very similar to nucleotides, but, unlike them, they do not contain any phosphate groups. Nucleosome: A basic unit of DNA packaging in eukaryotes, consisting of a segment of DNA wound around eight histone protein cores. This structure is often compared to thread wrapped around a spool. Nucleotide: Organic compound made up of a purine or pyrimidine (base) joined to a sugar and a phosphate group. Nucleic acids (DNA and RNA) contain nucleotides linked together in long chains. Compare to Nucleoside. Nucleus: A membrane-bound organelle containing the genetic material of a eukaryotic cell. Nutrigenomics: The study of the interaction between diet, genes, and environment and how they affect human health.
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
167
Oligonucleotide: A short fragment of single-stranded DNA or RNA that is typically 5e50 nucleotides long. Oligonucleotides can be primers to start PCR and also the fixed target in DNA microarrays. Oncogene: A gene that, when mutated, can promote growth beyond the cell’s normal needs, thus leading to tumors. For example, a growth factor gene that is always expressed, even when there is no signal for growth, can cause a cell to grow beyond its normal limits. Compare Tumor suppressor gene. Open Neural Tube Defect (ONTD): A birth defect characterized by incomplete development of the spinal cord or the brain. The most common ONTD is spina bifida, a condition where the vertebrae do not fully enclose the spinal cord. Operon view: A segment of DNA containing linked genes that function in a coordinated manner, usually under the control of a single promoter. Organelle: A subcellular structure in eukaryotic cells with specialized function. These membranebound compartments are sometimes compared to organs in the human body where each system has a different job. Examples include mitochondria, Golgi complex, endoplasmic reticulum, lysosome, peroxisome, and the nucleus. Organism: An individual animal, plant, or single-celled life form. Orthologous: Equivalent genes in different species that are homologous because they have both evolved in a direct line from a common ancestral gene. Paralogous: Two homologous genes in the same or different genomes that are similar because they derive from a gene duplication. Parasite: An organism that lives in or on another organism (its host) and benefits by deriving nutrients at the host’s expense. Pathogen: An agent causing disease. Pathology: The study of diseases and their causes, processes, development, and consequences. Penetrance: The proportion of individuals carrying a particular genotype that express the associated phenotype. Important for genetic diseases because complete penetrance means that 100% of people with the disease mutation will have the associated phenotype. Peptide bond: A covalent bond joining the a-amino group of one amino acid to the carboxyl group of another with the loss of a water molecule. It is the bond linking amino acids together in a protein chain. Phagocytosis view: The process of a cell engulfing and eating particles or other cells. Pharmacogenomics: A branch of pharmacology concerned with using DNA and amino acid sequence data to inform drug development and testing. An important application of pharmacogenomics is correlating individual genetic variation with drug responses. Pharmacology: The study of the action of drugs and other biologically active chemicals. Phenotype view: The set of observable characteristics of an organism that are the result of its genotype and the environment. Polygenic: Polygenic traits are affected by more than one gene. Phenotype variation in some traits is due to the interaction of many genes, each with a small additive effect on the character in question. For example, it is currently believed that there are at least 11 genes involved in the determination of skin color. Polymerase chain reaction (PCR): A technique for selectively and rapidly replicating a particular stretch of DNA in vitro to produce a large amount of that particular sequence.
168
Chapter 6 Getting DNA storage on board
Polymorphism: Literally, occurrence of different forms. This can mean different things depending on the field of study. Geneticsdpresence of variation in DNA sequence. For example: see Single nucleotide polymorphism. General Biologydpresence of individuals with different phenotypes in a colony, population, or species. Polysaccharide: A diverse class of high molecular weight carbohydrates formed by the covalent linking together of monosaccharides into linear or branched chains. Preimplantation genetic diagnosis (PGD): A technology that allows embryos created by in vitro fertilization to be tested for a genetic condition before transferring them to a uterus. PGD is an option for couples who are at risk of passing on a genetic condition. Once the embryo has grown to the 8e16 cell stage, one cell is removed, and genetic testing is done. Only those embryos that are not affected with the genetic condition tested for are implanted. Prevalence: The total number of existing cases of a disease or condition in a population at a specific point in time. For example, the World Health Organization estimates that over 177 million people have diabetes. Compare to Incidence. Primer: A short nucleic acid chain that serves as a starting point for the copying of DNA (DNA replication). This short stretch of DNA or RNA is complementary to part of the DNA that is about to be copied and binds to it to allow attachment of the other machinery needed (e.g., DNA polymerase) to copy the DNA. Prokaryote: An organism that consists of cells that do not have membrane-bound organelles. Most prokaryotes are single-celled organisms. Importantly, the DNA of prokaryotes is found loose in the cell rather than in a nucleus. Compare to Eukaryote. Promoter: A region of DNA located at the beginning (50 end) of a gene. It contains sequences important in starting transcription. Mutations in the promoter region may cause incorrect expression of the gene and can lead to disease. Protein: Large organic molecule made up of various combinations of amino acids. Proteins support living organisms’ shape and structure; carry messages within cells and between them; and, as enzymes, regulate the chemical processes that sustain life. Protein synthesis: The process of building proteins by reading messages carried on mRNA and converting them to chains of amino acids using ribosomes and transfer RNA (tRNA). Proteomics: The science that studies which proteins of the genome are expressed and when. Initially aimed at cataloging the proteins present in a cell under various conditions, proteomics has joined up with genomics to try to understand how the expression of the genome enables all the complex functions of the cell to work. Punnet Square view: A Punnett Square is a diagram that illustrates the possible genotype combinations from a mating as determined by Mendelian inheritance rules. Recessive: A heritable characteristic controlled by genes that are expressed in offspring only when inherited from both parents (i.e., homozygous recessive). Recombination view: Reorganization, shuffling, or other moving of genetic material from one place on the chromosome to another or to a different chromosome. The most common example is crossing over. Restriction fragment length polymorphism (RFLP): A size difference (length polymorphism) among fragments after DNA has been cut with restriction enzymes. Each cut piece is a restriction fragment and each fragment size is affected by DNA sequence. For example, if a target sequence is present, the restriction enzyme will cut the DNA molecule; if a target sequence is missing, the DNA
Appendix 6.A Glossary of DNA encoding (from merit CyberSecurity library)
169
will not be cut and the fragment will consequently be longer. This characteristic has been important in creating lab tests for paternity and for DNA fingerprinting. Ribonucleic acid (RNA) view: A long nucleic acid molecule found in the nucleus and cytoplasm of a cell. Similar to DNA, RNA consists a sugar-phosphate backbone with nitrogenous bases. RNA differs from DNA by having a different sugar in its backbone (ribose instead of deoxyribose); having uracil as a base instead of thymine; and functioning as a single-stranded molecule instead of a doublestranded helix. One function of RNA is to convey genetic information, encoded by DNA, to the protein synthesis machinery. This process is known as translation and involves three types of RNA that work together to achieve this task: mRNA, rRNA, and tRNA. Ribosomal RNA (rRNA): The RNA component of the ribosome. It works together with mRNA, tRNA, and amino acids during translation. Ribosome: A complex of proteins and rRNA that converts information from mRNA into an amino acid chain. In other words, it translates the mRNA sequence into a protein. Risk: The probability of a negative event occurring. For example, women have a 12% risk of developing breast cancer over their lifetime. The perception of risk can vary between different people. With our example, some people might think 12% is a low risk, whereas others might think it is a high risk. RNA polymerase: The enzyme that transcribes DNA into RNA. Semi dominant: Having an intermediate phenotype in an individual that has a heterozygous genotype for a particular trait. For example, degree of hair curliness is semidominant; if “A” represents curly hair and “a” represents straight hair, an individual who has an “Aa” genotype would have wavy (i.e., not curly but not straight, rather intermediate) hair. Sequencing: Reading of the components of a molecular chain of building blocks. For example, determining the order of nucleotides in a DNA or RNA chain or the amino acids within a protein. Sex-linked traits view: The tendency of certain characteristics to appear in one sex. Traits encoded by genes on one of the sex chromosomes (X or Y chromosomes in humans) can be expressed differently in males and females because males have an X and a Y chromosome, whereas females have two X chromosomes. Side chain: The part in an amino acid not involved in forming the peptide bond. The side chain gives each amino acid its characteristic chemical and physical properties. Single nucleotide polymorphism (SNP): A variation in a single base (A, T, C, or G) within a sequence of DNA. For any single-base variation to be called an SNP the minor allele must be found in more than 1% of the population. So far more than 6 million SNPs have been discovered in the human genome. SNPs do not generally cause disease directly, but some SNPs may indicate an individual’s susceptibility to disease or the response to drugs and other treatments. Somatic cell nuclear transfer (SCNT): A laboratory method for creating clones of animals. SCNT requires two donors: a nucleus and an egg. The egg is stripped of its nucleus and the donor nucleus is placed inside. Because the donor nucleus is diploid, whereas normal egg cells are haploid, the egg is “tricked” into behaving as if it were fertilized. This leads the egg to divide and differentiate, resulting in an embryo that will develop into a copy (clone) of the organism that donated the nucleus. Somatic cells: The cells that form the body only. Unlike in germ cells, mutations or manipulations in these cells are not passed on to the next generation. Spina bifida: A condition where the bones of the spine (vertebrae) do not close properly around the spinal cord so part of the spinal cord is exposed and nerve damage occurs. The extent of damage
170
Chapter 6 Getting DNA storage on board
depends on the size and location of the vertebrae gap. Spina bifida can occur alone or as part of a syndrome with other multiple birth defects. The exact cause of spina bifida is unknown. However, it is thought that both environmental and genetic factors play a role (multifactorial inheritance). Spindle: The structure formed by microtubules stretching between opposite sides of the cell during mitosis or meiosis and which guides the movement of chromosomes. Sporadic: Rare or unpredictable. In genetics, it specifically means that it is not known to be inherited. For example, if one person in a family had a genetic condition that was not found in any other family members, they may be referred to as having a sporadic case of that condition. Stem cell: A cell that has the ability to self-renew or divide indefinitely. Stem cells can also develop into other specialized cell types. There are several classifications: totipotent, pluripotent, and multipotent. Telomere: A region of repetitive DNA at the end of a linear chromosome that serves to protect the end from deterioration or destruction. In somatic cells, with each round of cell division, telomeres get shorter, and once they become too short to protect the ends of the chromosomes, cell division may be prevented, or cells may die. Teratogen: A substance or exposure that causes birth defects. Many drugs, infections, chemicals, and radiation can be teratogens. Trait: A characteristic that an individual possesses. It could be a personality or behavior trait (like being warm and friendly), intelligence (a high IQ score), or a physical feature (like red hair or height). Genes may or may not have a determining role in a trait. Transcript: The RNA that is synthesized by RNA polymerase on a DNA or RNA template. Transcription: The enzymatic process where complementary RNA sequences are created from DNA templates. Transcriptomics: The complete set of RNA transcripts produced by the genome at any given time. Transfer RNA (tRNA): RNA responsible for bringing amino acids to the ribosome and working with the mRNA and rRNA to make an amino acid chain. tRNA transfers the amino acid to the growing protein chain during translation. Each tRNA molecule is specific to a certain amino acid. Transgenic: Having a piece of DNA that originally stems from a different species. Transgenic organisms are a subset of genetically modified organisms (GMOs). Translation: The process of producing proteins from the information stored on an mRNA molecule. In other words, it is the process of translating from the language of nucleotides to the language of proteins. Translocation: The transfer of part of a chromosome onto another nonhomologous chromosome. Translocations may be balanced (having the right amount of chromosome material) or unbalanced (having too much or too little chromosome material). A reciprocal translocation is where two nonhomologous chromosomes exchange material. Transposon: A segment of DNA that can remove itself from a chromosome and insert itself somewhere else in the genome, i.e., it can “jump” from one place in the genome to another. Transposons can also move between organisms. They play a role in the transfer of antibiotic resistance among bacteria and can cause disease by creating mutations. Triplet repeat expansion: Triplet repeats are three nucleotides that are repeated a number of times (e.g., TACTACTACTACTAC). Triplet repeat expansion is when extra repeats are added (e.g., more TACs). If this large repeat is within a gene, it can cause disease (e.g., Huntington’s disease).
Suggested readings
171
Trisomy: The occurrence of an extra chromosome (i.e., third copy of a particular chromosome) in the total chromosome count of an individual. People typically have two copies (disomy) of all of their chromosomes. Often, having trisomy of a full chromosome is not compatible with life except for an additional chromosome 21 (Down syndrome) or extra sex chromosomes. Trisomy 18 and Trisomy 13: Typically, lethal conditions caused by an extra chromosome in each cell (chromosomes 18 and 13, respectively) and involving many birth defects. Trisomy 21: Also known as Down syndrome. Caused by an extra chromosome 21 in each cell. Tumor: A growth resulting from the abnormal proliferation of cells. It may be self-limiting or noninvasive, when it is called a benign tumor. A malignant tumor, or cancer, is when a tumor proliferates indefinitely and invades underlying tissues and metastasizes. Tumor suppressor gene: A normal gene that controls how often and how fast a cell divides. If both copies of a tumor suppressor gene are inactivated, the growth of the cell may go out of control and become a tumor. Ultrasound: A technology that uses sound waves to create a picture. An ultrasound technician (sonographer) uses a hand-held wand (transducer) to direct sound waves at the internal body part of interest. The reflection or return of these waves creates a black and white image. Fluid appears black, solid material like bone appears white, and other tissues, like most organs, appear as shades of gray. Uterus: The organ in which the embryo develops and is nourished before birth in female mammals. In other mammals, the uterus is an enlarged portion of the oviduct modified to serve as a place for development of the young or of eggs. Variable-number tandem repeat (VNTR): Catch-all term for repeats in the genome. Generally, not associated with disease, these repeats are used in DNA profiling because the sequences vary in length between people. VNTR sounds complicated, but the phrase simply means “repeats that vary in size.” Virus: An intracellular obligate parasite that is unable to multiply or express its genes outside a host cell as it requires host cell enzymes to aid DNA replication, transcription, and translation. Viruses cause many diseases of man, animals, plants, and bacteria. X chromosome view: One of the two sex chromosomes in humans. Women generally have two X chromosomes, whereas men have one X chromosome and one Y chromosome. X chromosome inactivation: The inactivation of all but one copy of the X chromosome in female mammals. Y chromosome view: One of the two sex chromosomes in humans. Men generally have one X chromosome and one Y chromosome, whereas women have two X chromosomes. The Y chromosome contains the genes that trigger male development and proper sperm formation. Zygote: A diploid cell that is the result of the fusion of a haploid egg and haploid sperm. The term “zygote” only applies before cell division starts (after which it is called an embryo).
Suggested readings Bitcoin https://www.cnet.com/news/bitcoin-fanatics-are-storing-their-cryptocurrency-passwords-in-dna- carverr/. Blockchain and Ethereum https://hackernoon.com/ethereum-turing-completeness-and-rich-statefulnessexplained-e650db7fc1f b. Church, G., Regis, E., 2012. Regenesis. Basic Books Publishing.
172
Chapter 6 Getting DNA storage on board
Digital technology transforming Dubai https://www.weforum.org/agenda/2017/05/how-digital-technology- istransforming-Dubai/. DNA archival https://homes.cs.washington.edu/wbornholt/dnastorage-asplos16/DNA-based archival system https://homes.cs.washington.edu/wbornholt/dnastorage-asplos16/#fig:systemoperationhttps://homes.cs. washington.edu/wbornholt/dnastorage-asplos16/#cite:10. DNA computing https://pdfs.semanticscholar.org/479e/81852d1f4c72ea9c221aeec6152c6abca6ab.pdfGoldman DNA Charts. https://www.cs.utexas.edu/wbornholt/dnastorage-asplos16/. DNA writer https://earthsciweb.org/js/bio/dna-writer/. Dubai innovation index. http://www.dubaichamber.com/uploads/pdf/DII2017Report_Feb26.pdf Enhancing Big Data security http://www2.advantech.com/nc/newsletter/whitepaper/big_data/big_data.pdf Enzymic DNA storage https://www.biorxiv.org/content/biorxiv/early/2018/06/16/348987.full.pdf Evolution of storage https://www.zdnet.com/topic/the-evolution-of-enterprise-storage/. Glossary of DNA Encoding (From Merit CyberSecurity Library) www.termanini.com. Golden ratio www.goldennumber.net/golden-ratio-divine-beauty-mathematics-meisner/ HBR blockchain https:// hbr.org/2017/01/the-truth-about-blockchain. How to deal with data deluge https://www.zdnet.com/article/from-digital-to-biological-why-the-future-ofstorage-is-all-about-dna/. Huffman encoding www.cs.ucsb.edu/wfranklin/20/assigns/prog6files/HuffmanEncoding.htm Idaho State Micron Dept https://www.idahostatesman.com/news/business/article218442875.html IEEE. Huffman, D.A., September 1952. A method for the construction of minimum redundancy codes. Proc. IRE 40 (9), 1098e1101. https://jeenalee.com/2016/11/21/pwl.html: A DNA-Based Archival Storage System. https://www.cs.utexas.edu/wbornholt/dnastorage-asplos16/cs.utexas.edu. Ignatova, Z., Martinez-Perez, I., 2008. DNA Computing Models. Springe Publishing. Laws of accelerating return http://www.kurzweilai.net/the-law-of-accelerating-returns Mathematics of DNA https://evo2.org/mathematics-of-dna/. New trends in DNA storage www.ncbi.nlm.nih.gov/pmc/articles/PMC5027317/ Richard P. Feynman, “Surely You’re joking, Mr. Feynman!”, Norton & Company, 1985. Smart city as Dubai https://www.econstor.eu/ bitstream/10419/146853/1/848989597.pdf Vince Buffalo, “Bioinformatics Data Skills”, O’Reilly Media, Inc., 2015. paper https://spectrum.ieee.org/semiconductors/devices/exabytes-in-a-test-tube-the-case-for-dna- data-storage. What is an Oligo https://www.biosyn.com/faq/what-is-an-oligo-or-oligonucleotide.aspx.
CHAPTER
Synthesizing DNA-encoded data
7
I believe things like DNA computing will eventually lead the way to a “molecular revolution,” which ultimately will have a very dramatic effect on the world. dL. Adleman.
We believe this is the highest-density data-storage device ever created. dYaniv Erlich.
As DNA molecules are very small, a desktop computer could potentially utilize more processors than all the electronic computers in the world combined. dRoss D. King.
A mind needs books like a sword needs a whetstone, if it is to keep its edge. dTyrion Lannister.
One trillionth of a gram! When it comes to storing information, hard drives do not hold a candle to DNA. Our genetic code packs billions of gigabytes into a single gram. A mere milligram of the molecule could encode the complete text of every book in the Library of Congress and have plenty of room to spare. All of this has been mostly theoreticalduntil now. In a new study, researchers stored an entire genetics textbook in less than a picogram of DNAdone-trillionth of a gramdan advance that could revolutionize our ability to save data.
The DNA writer A simple program, developed by Lensyl Urbano, shows how a simple text can be converted into DNA code and back to text. The screen is shown in Fig. 7.1. We contacted Lency Urbano to discuss the program, which is for the public to get a copy. In the first step, the information in DNA is transferred to a messenger RNA (mRNA) molecule by way of a process called transcription. Translation is the process by which a protein is synthesized from the information contained in a molecule of mRNA. The genetic code is a set of three-letter combinations of nucleotides called codons, each of which corresponds with a specific amino acid or stop signal. Translation occurs in a structure called the ribosome, which is the assembly factory for the Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00007-0 Copyright © 2020 Elsevier Inc. All rights reserved.
173
174
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.1 It is a black-box program (input and output). You type a strand of DNA; it converts the DNA stream to binary and then to final text. The translation engine in the DNA Writer code uses a simple look-up table where each letter in the English alphabet is assigned a unique three-letter nucleotide code. The three letters are chosen from the letters of the DNA basesdAGCTdsimilar to the way codons are organized in mRNA. Any unknown characters or punctuation are ignored.
synthesis of proteins. The original DNA genetic string made of ATCG will be grouped into a set of three nucleotides called codons. The order of codons represents the sequence of amino acids in the proteins. So the process goes like this: DNA code, reassembled as codons, which selects the amino acid codes from the translation table, sends them to the ribosome to make proteins. Fig. 7.2 illustrates the structure of the codons.
DNA Fountain software strategy The researchers then randomly compiled the strings into so-called droplets using DNA Fountain codes, which were developed by Erlich and Zielinski. Here is their story of success: Dr. Yaniv Erlich is a DNA scientist of world renown and serves as an assistant professor of Computer Science and Computational Biology at Columbia University and as a core member at
DNA Fountain software strategy
175
FIGURE 7.2 The table on the left is a simple look-up table where each letter in the English alphabet is assigned a unique three-letter code from the letters of the DNA basesdAGCT. The figure on the right shows how a sequence of three nucleotides (base pairs) together forms a unit of genetic code. AUG is an initiation codon; UAA, UAG, and UGA are termination (stop) codons.
the New York Genome Center. With the assistance of his associate Dina Zielinski, they developed an efficient encoding strategy of DNA storage, called DNA Fountain coding, to encode a full computer operating system. The binary droplets were translated into four DNA nucleotide bases (DNA blocks)dA, G, C, and T. The erasure-correcting Fountain’s algorithm ensured no letter combinations known to cause errors were used and assigned a barcode to each droplet to aid file retrieval and reassembly. The coding process produced 72,000 DNA strands, each 200 bases long. Then researchers Erlich and Zielinski sent the encoded DNA file to Twist Bioscience, a bioinformatics company in San Francisco that turns digital (binary) DNA into biological DNA (AGCT). Two weeks later, the company sent the researchers a file containing their DNA strands. DNA Fountain Storage is the leader among the other products, as shown in Fig. 7.3. For consistency, the table describes only schemes that were empirically tested with pooled oligo synthesis and high-throughput sequencing data. In Table 7.1, the schemes are presented chronologically based on publication date. Coding potential is the maximal information content
176
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.3 The cycle from binary to DNA binary. In phase 1, we convert plain text to 8-bit binary; then in phase 2, we use the encoding code: A ¼ 00; T ¼ 11; G ¼ 10; C ¼ 01 to convert the binary into DNA code (synthesis). In phase 3, we sequence the synthetic DNA code and store the converted DNA code, and in phase 4, we decode the DNA code into binary data. 2017 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
Table 7.1 The schemes are presented chronologically on the basis of date of publication.
Input data [Mbyte] Coding potential(1) [bits/nt] Redundancy(2) Robustness to dropouts(3) Error correction/ detection(4) Full recovery(5) Net information density(6) [bits/nt] Realized capacity(7) Physical density(8) [PB/g]
Church et al.3
Goldman et al.4
Grass et al.5
Bornholt et al.6
Blawat et al.7
This work
0.65 1
0.75 1.58
0.08 1.78
0.15 1.58
22 1.6
2.15 1.98
1 No
4 Repetition
1 RS
1.5 Repetition
1.13 RS
1.07 Fountain
No
Yes
Yes
No
Yes
Yes
No 0.83
No 0.33
Yes 1.14
No 0.88
Yes 0.92
Yes 1.57
45% 1.28
18% 2.25
62% 0.005
48% e
50% e
86% 214
(Line 1) Coding potential is the maximal information content of each nucleotide before indexing or error correcting. (Line 2) Redundancy denotes the excess of synthesized oligos to provide robustness to dropouts. (Line 3) The proposed strategy to recover oligo dropouts (RS codes). (Line 4) The presence of error correcting/detection code to handle synthesis and sequencing errors. (Line 5) Whether all information was recovered without any error. (Line 6) The input information in bits divided by the number of overall DNA bases requested for sequencing (excluding adapter annealing sites). (Line 7) The ratio between the net information density and the Shannon Capacity of the channel. (Line 8) Physical density is the actual ratio of the number of bytes encoded and minimal weight of the DNA library used to retrieve the information. Comparison of DNA storage coding schemes and experimental results; Courtesy of https://www.academia.edu/31886464/DNA_ Fountain_enables_a_robust_and_efficient_storage_architecture.
Fountain software architecture
177
of each nucleotide before indexing or error correcting. Redundancy denotes the excess of synthesized oligos to provide robustness to dropouts. Error correction/detection denotes the presence of error correction or error detection code to handle synthesis and sequencing errors (RS, ReedeSolomon codes). Full recovery indicates whether all information was recovered without any error. Net information density indicates the input information in bits divided by the number of synthesized DNA nucleotides (excluding adapter annealing sites). Realized capacity is the ratio between the net information density and the Shannon capacity of the channel. Physical density is the actual ratio of the number of bytes encoded and the minimal weight of the DNA library used to retrieve the information. This information was not available for studies by Bornholt and colleagues or Blawat and colleagues, as indicated by the dashes in the figure.
Fountain software architecture It is a new “open source” programdwith license agreementddesigned for the management of large sequencing projects. It uses the Oracle/JAVA and Oracle relational DB (Version 18cdPolymorphic Table Functions, with Active Directory Integration). Starting with a collection of clone sequencing objects, the program generates and stores information related to the different stages of DNA sequencing, using a Web browser interface for user input. The user enters the binary data and lets Fountain do the rest. The generated sequences are subsequently imported and annotated based on basic local alignment search tool (BLAST). (In bioinformatics, BLAST is an algorithm for comparing primary biological sequence information, such as the nucleotides of DNA sequences.) BLAST allows researchers to run a query with reference library database search criteria against the public databases. In addition, BLAST has simple algorithms to cluster sequences and determine that putative polymorphic positions (to validate certain DNA criteria) are implemented. The combination of JAVA and Oracle relational database gives Fountain the advantage of being scalable and easy to maintain. No additional skills are needed apart from a basis knowledge of JAVA, SQL, and the administration of a relational database, and even a novice programmer should be able to customize the software. This is perhaps best explained by the fact that the main author of Fountain is a biologist, who only recently started programming. All parts of Fountaindthe database, the Web server, and the JAVA classesdcan be installed on the same computer. However, Fountain is well suited for installation within a network as all connections to the database rely on the JDBC API and Web browsers. We preferred Oracle 10i over MySql as the RDMS in the intranet due to the availability of foreign key constraints and transactional control. In our hands, Oracle 10i is not difficult to install and to run on Linux, and the fee for an intranet license is moderate. The Sequence, Cluster, and Polymorphism work packages are presented on the Internet Web server using MySQL as the RDMS. Fountain was developed to handle the following tasks for large-scale sequencing projects: 1. Generation and storage of data related to libraries, clones, templates, primers, and sequence reactions 2. Import and annotation of sequence data 3. Clustering of sequences 4. Definition of single nucleotide polymorphism (SNP) or sequencing errors 5. Storage and retrieval of user-defined variables
178
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.4 The overall configuration of the Fountain system. Fountain is a modular system with several functional modules. The Oracle relational tables can be accessed from the different modules that can be modified without affecting the tables or other modules. 2014 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
Fountain is divided into different work packages based on these tasks. This division is reflected in the JAVA package and subpackage structure and in the table structure (schema) of the database. In this way, the code of one work package can be easily modified without affecting the functionality of the other work packages as long as the foreign key restrictions of the database scheme are respected. In addition, it is possible to simplify or exchange entire work packages after deletion or modification of the foreign key restrictions. Fig. 7.4 shows the simplistic architecture of the system.
A reliable and efficient DNA storage architecture DNA is an attractive medium to store digital information. Here, we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using this approach, a full computer operating system, along with a movie, and other files with a total of 2.14 106 bytes in DNA oligonucleotides were stored, and the information was perfectly retrieved from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 1015 retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports. Humanity is currently producing data at exponential rates, creating a demand for better storage devices. DNA is an excellent medium for data storage, owing to its demonstrated information
A reliable and efficient DNA storage architecture
179
density of petabytes of data per gram, high durability, and evolutionarily optimized machinery to faithfully replicate this information. Recently, a series of proof-of-principle experiments has demonstrated the value of DNA as a storage medium. Fig. 7.5 shows the source code of Oracle tables.
FIGURE 7.5 The schema structure of Fountain tables in Oracle. We can see how table elements are chained together to achieve its efficient relational linking. Instead of creating one long table that has all the elements, SQL created small tables (by subject) and linked then together by pointers. MERIT CyberSecurity Engineering got FOUNTAIN program and conducted several Tests.
180
Chapter 7 Synthesizing DNA-encoded data
DNA computing is around the corner DNA computing is a brand-new domain (not a universe yet) that is very interesting to people with innovation-creativity dynamic neurons in the right side of their brain cerebrum. People who have a musical left side perhaps will prefer “The Sound of Music” movie. As the master code of life, DNA can do a lot of things: track inheritance and gene therapy, wipe out an entire species, solve logic problems, execute 200 million instructions per second, and recognize your sloppy handwriting. In a brilliant study published in the July 2018 issue of Nature, a team from Caltech cleverly hacked the properties of DNA, essentially turning it into a molecular artificial neural network. When challenged with a classic machine learning taskdrecognize handwritten numbersdthe DNA computer can deftly recognize characters from one through nine. The mechanism relies on a particular type of DNA molecule, dubbed the “annihilator” (awesome name!), which selects the winning response from a soup of biochemical reactions. But it is not just sloppy handwriting. The study represents a quantum leap in the nascent field of DNA computing, allowing the system to recognize far more complex patterns using the same number of molecules. With more tweaking, the molecular neural network could form “memories” of past learning, allowing it to perform different tasksdfor example, in medical diagnosis. The concept of DNA computing was first introduced in 1994. It deals with the “biochips” made of DNA that can perform billions of calculations at once by multiplying themselves in number. A DNA computer grows as it computes. In a recent development, the researchers from the University of Manchester have shown that the creation of this conceptual computer is possible in real life (http:// www.computinghistory.org.uk/det/6013/The-Manchester-Baby-the- world-s-first-stored-programcomputer-ran-its-first-program). Intel is struggling to increase the speed of its CPUs due to the limitations of Moore’s law. The other processor makersdSamsung, Taiwan Semiconductor Manufacturing (TSMC), and GlobalFoundriesdare also working hard to beat the speed records. There is no denying the fact that researchers and scientists need to look for silicon alternatives for faster computing. The silicon-based computers have a finite number of processors, and thus, their capabilities are also finite.
The adleman discovery Here is a surprising revelation: It almost sounds too fantastic to be true, but a growing amount of research supports the idea that DNA, the basic building block of life, could also be the basis of a staggeringly powerful new generation of computers. The idea of DNA computing was invented in 1994 by the famous cryptographer Leonard Adleman (inventor of RSA-Rivest, Shamir, Adleman). He used DNA to solve the “traveling salesman” problem. The problem aimed at finding out the shortest route among several cities by going through each city only once. Adleman showed that billions of molecules in a drop of DNA had so much computational power that it can simply overpower silicon and the powerful human-based computers. Adleman assigned sequences of the genetic alphabet A, T, C, and G to each of seven cities. Each city got a different strip of DNA, 20 molecules long, then dropped them into a stew of millions of more strips of DNA that naturally bonded with the “cities.” That generated thousands of random paths, in much the same way that a computer can sift through random numbers to break a code. From this hodgepodge of connected DNA, Adleman eventually extracted a satisfactory solutionda strand that led directly from the first city to the last, without retracing any steps. DNA computing was born.
Dr. Jian-Jun Shu discovery
181
We can say that DNA computing, a new branch of biomolecular computing, is a network of DNAbased nanodevices with decision-making and information-processing capabilities. DNA computing is utilizing the property of DNA for massively parallel computation and parallel search. Fig. 7.6 shows the emergence of DNA computing in bioinformatics.
FIGURE 7.6 This is an incredible isometric blueprint of one of the most pioneering architectures showing the emergence of DNA computing and its pathological connections with bioengineering, medicine, Deep AI defense, DNA storage, like car cylinders. Here, we are focusing on DNA molecular storage and DNA computing. AI, artificial intelligence.
Dr. Jian-Jun Shu discovery
182
Chapter 7 Synthesizing DNA-encoded data
Here is an interesting story about DNA-enabled computing developed by Dr. Jian-Jun Shu of the University of Nanyang Technological University in Singapore. In search for a more affordable way forward, scientists are exploring the use of DNA for its programmability, fast processing speeds, and tiny size, www.acs.org/content/acs/en/pressroom/presspacs/2015/acs-presspac-may-6-2015/the-nextstep-in-dna-computing-gps- mapping.html. Traveling salesman problem has been great challenge for mathematicians. The time to reach the destination with one connect pass sometimes takes weeks. DNA came to the rescue. So far, they have been able to store and process information with the genetic material and perform basic computing tasks. Shu’s team set out to take the next step. The researchers built a programmable DNA-based processor that performs two computing tasks at the same time. On a map of six locations and multiple possible paths, it calculated the shortest routes between two different starting points and two destinations. The researchers say that in addition to cost- and time-savings over other DNA-based computers, their system could help scientists understand how the brain’s “internal GPS” works.
The BLAST algorithm software The sequence, cluster, and polymorphism modules use queries to access the public databases with similar relations before storing the records in the master sequence of the database. BLAST algorithm is the preferred software due to the following advantages: 1. Searches against the latest version of the public databases can be done over the Internet; the output format is structured and can be easily parsed. However, using this algorithm, a similarity search of a large nucleic acid sequence database or protein database with a single query sequence of moderate length may require an hour on a modern workstation. 2. Accordingly, rapid heuristic algorithms such as FASTA (FAST-All) and the BLAST have been developed that can perform these searches up to two orders of magnitude faster than the SmitheWaterman algorithm. Heuristic methods are used to speed up the process of finding a satisfactory solution, but they are not necessarily optimal. 3. Information about the homologous query sequences can be obtained from the server of the National Center for Biotechnology Information (NCBI), which is part of the United States National Library of Medicine, http://www.ncbi.nlm.nih.gov:80/entrez. For these reasons, the Fountain software relies heavily on BLAST and Fast analyzers.
BLAST software architecture The BLAST is a search engine with a series of mining programs that takes a DNA query or protein sequence, parses it, and searches its DNA or protein sequence databases for homology. A BLAST search enables a researcher to compare a sequence with a match from a central library of sequences and identify the sequence that is above a certain threshold. BLAST uses several heuristic algorithms to generate alignments with high scores. Alignments are evaluated by scores. We can resort to a probabilistic approach to approximate the estimated alignment scores, which follows the exponential distribution. The expected number of ungapped alignments with random sequences and maximum score S is: E ¼ Km eLlS
The story of binary code
183
where, • • • • • • •
S is a score derived from the scoring matrix, l is a normalizing constant, K is a search-space related constant, m is the size of the query, e is Euler Number ¼ 2.71, n is the size of the database (namely, n is the sum of the lengths of all the sequences in the database), E is the number of alignments during a database search and is a function of the search time (in minutes), the normalized score (lS), and a constant k.
One word on FASTA software system Bioinformatics (see Appendix 7.A, Bioinformatics) is an advanced technology that includes many fields, starting with bioresearch, development, application of computational tools, and approaches for expanding the use of biological applications, including those to acquire, store, organize, and analyze such data. Now, bioinformatics includes DNA storage for binary data, which is a branch of DNA computing. FASTA is a multifunction package that performs a slew of biochemical activities, with sophisticated algorithms developed by Fredi and Barton and enhanced by Lipman and Pearson in 1985. FASTA is a DNA and protein sequence alignment software. It is a fast homology search tool. FASTA is faster than BLAST during sequence comparison. It also executes queries to compare the alignment of DNA sequences against a master DNA library. More information can be found at https://en.wikipedia. org/wiki/FASTA.
The story of binary code In 1854, British mathematician George Boole published his famous paper detailing an algebraic system of logic that would become known as Boolean algebra. His logical calculus became instrumental in the design of digital electronic circuitry. Claude Shannon, in 1938, published a paper showing that the two-valued Boolean algebra, which, he discovered independently, can describe the operation of switching circuits. So, 0 and 1 were represented by exclusive open and closed switching circuits. The concept of switching circuits was translated into a binary status of bitsdon is 1 and off is 0 as shown in Fig. 7.7. Then, the early computer “engine” incorporated an arithmetic logic unit with control flow in the form of conditional branching and loops, and integrated memory, making it the first design for a general-purpose computer that could be described in modern terms as Turing complete. Computer chip manufacturers are ardently competing to engineer the next microprocessor that will break the speed records. Eventually, this competition is going to hit a wall. There have been many inventions throughout our time. Electricity graˆce a` Thomas Edison and Nicola Tesla, who changed the course of humanity. It started with one incandescent light lamp, and today, electricity is one of the most
184
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.7 This is the Rosetta Stone of Computer engineering. It translates electrical pulses into our universal alphabet that we all understand. Without this table, we would not be able to read or write anything from any computing device.
important pillars of civilization and is hyperexponentially consumed like a runaway train. One of the great contributions to humanity was the discovery of the vaccine. Without adaptive immunity, onefourth of the human race would have been terminally ill. Pasteur gave humanity the best defense against invading pathogens. Among the greatest contributions of our nation to the rest of the world is the Internet. Today, it is the crochet of our societal fabric. Without the Internet, Amazon would sink like the Titanic in less than 6 hours. The annals of innovation are full of inventions that started with one “eureka” and a visionary mind. Now, we have a new disruptive technology called “DNA computing,” where Dr. Leonard Adleman demonstrated that the billions of molecules in a drop of DNA contained raw computational power that might, just might, overwhelm silicon. DNA data storage is another earth-shattering technology pioneered by Harvard University geneticists George Church and colleagues. They encoded a 52,000-word book in thousands of snippets of DNA, using strands of DNA’s four-letter alphabet of A, G, T, and C to encode the 0s and 1s of the digitized file.
Video and audio media features
185
Nondeterministic universal turing machine Let us define first what a Turing machine is. It is a rudimentary character comparison machine. It has no memory other than a paper tape that keeps the input data output data. A deterministic machine always produces the same output for the same input. Nondeterministic machine will generate a different output every time we run the same input. A universal Turing machine is a general-purpose programmable computer. That means that it can solve any problem that you like. So instead of having one computer for email, one for playing music, and one for writing documents, you can have one machine that does all of these. You just need to install the right software. Professor Ross D. King and his team from Manchester’s School of Computer Science have demonstrated for the first time the feasibility the engineering of a nondeterministic universal Turing machine (NUTM), and their research was published in the prestigious Journal of the Royal Society Interface. The theoretical properties of such a computing machine, including its exponential boost in speed over electronic and quantum computers, have been well understood for many years; however, the Manchester breakthrough demonstrates that it is actually possible to physically create an NUTM using DNA molecules. Quantum computers and their quantum bits can also generate simultaneous and divergent paths, but they require specific symmetries to function properly, limiting their application and adaptability. The design of Turing machine using DNA exploits DNA’s ability to replicate to execute an exponential number of computational paths in P time. Each Thue rewriting step is embodied in a DNA edit implemented using a novel combination of polymerase chain reactions (PCR) and site-directed mutagenesis. A couple of definitions are helpful because we will be using these terms quite often in all the chapters of this book: PCR: This is a method widely used in molecular biology to make multiple copies of a specific DNA segment. Using PCR, a single copy (or more) of a DNA sequence is exponentially amplified to generate thousands to millions of more copies of a particular DNA segment. DNA synthesis: This is the natural or artificial creation of deoxyribonucleic acid (DNA) molecules. The DNA synthesis is DNA replicationdphysically creating artificial gene sequences.
Video and audio media features Before DNA code is built, text data must be converted into binary code and then synthesized into DNA coding format. In this section, we would like to make the reader an expert in the multimedia structures and formats. Compact disc (CD): The CD was launched in October 1982. The optical disc format has been very successful in this time and provides a compact and reliable distribution format, not just for music but for other applications as well. The CD is forecast to remain the mainstream format for music for some years to come. The CD will co-exist with Internet and exchange music. The DVD digital versatile disc, which is a digital optical disc storage format, but primarily the threat will come from DNA’s penetration into data storage. Despite the introduction of the DVD, the CD is likely to remain the
186
Chapter 7 Synthesizing DNA-encoded data
Table 7.2 The different recording formats of CD and DVD, and their recording capacities that are significantly used for different recording purposes. Format
Capacity
Description
CD DVD-5 DVD-10 DVD-9
0.7 GB 4.7 GB 9.4 GB 8.5 GB
Single layer, single side; read from one side only Single layer, single side; read from one side Single layer, double side; read from both sides dual layer, single side; read from one side
dominant format for music for another decade. The CD supports a range of prerecorded formats for music, computer data, video games, and other applications shown in Table 7.2. The following are the formats of the CD: • • •
CD Audio is the original CD format on which all other formats are based. CD Audio discs may also use CD-Graphics or CD-Text, whereas CD-Extra adds computer data to the audio. CD-ROM XA is a multimedia version of CD-ROM, used as the basis for CD-I, Video CD, and Photo CD. CD Bridge allows the last two formats to play on CD-I players. CD-ROM is derived from CD Audio to store computer data for PC games and other applications.
DVD disk: DVD-Video has become the format of choice for high-quality movies, TV series, and music videos. The DVD-Video specification provides the following features: • • • • • • •
133 min of high-quality MPEG-2-encoded video with multichannel surround-sound audio The choice of widescreen, letter box, and pan and scan video formats Menus and program chains for user interactivity Up to nine camera angles to give the user more choice Digital and analog copy protection Parental control for protection of children DVD disks are forecasted to last until around 2030, when they will be replaced by DNA storage.
Digital video Digital video for CD and DVD has distinct characteristics: Video and film must be converted to a digital format for storing on CD or DVD. The following formats use digital video: • • • •
CD-ROM applications use a range of digital video options including MPEG-1, depending on the requirements. Video CD and CD-i use MPEG-1 coding for full-screen video on a CD with playing times of over 70 min. Super Video CD (SVCD) uses MPEG-2 for higher-quality video but with playing times of only 40 min. DVD-Video uses MPEG-2 for the highest quality video and playing times of up to several hours.
Binary to DNA code, revisited
187
MPEG-encoded video normally conforms to PAL/SECAM (625 line) or NTSC (525 line) video, although the number of lines may be reduced particularly for MPEG-1 encoding. •
•
•
National Television Systems Committee (NTSC) of the Electronic Industries Association (EIA) prepared the TV standard for the United States, Canada, Japan, Central America, half of the Caribbean, and half of South America. When referring to NTSC video, what is normally meant is 525-line 60 Hz. The number of active lines is 480. Phase alternation line (PAL) is the TV format used in most of Western Europe, Australia, and other countries. When referring to PAL video, what is normally meant is 625-line 50 Hz video, since PAL only refers to the way color signals are coded for broadcast purposes. The number of active lines is 576. Sequential Couleur a Memoire or sequential color with memory (SECAM) is the TV format used in France, Eastern Europe, and other countries. Like PAL, SECAM video means 625-line 50 Hz. The number of active lines is 576.
Binary to DNA code, revisited Anatomy of text and binary To have synthetic DNA, the input data (text and numbers) must be in binary format. Computers operate in binary, meaning they store data and perform calculations using only 0s and 1s. A single binary digit can only represent True (1) or False (0) in Boolean logic. DNA synthesizers accept only binary that gets converted to DNA code (encoding process). Table 7.3 illustrates the overwhelming proliferated number of different recording formats. Fig. 7.8 shows the four phases of converting user-created file with a variety of formats (text, video, audio, music, digital) into a DNA-sequence (phase-3) and back to binary in phase-4). It is a closed loop cycle with highly successful reliability.
Table 7.3 The overwhelming variety of recording formats. Common extensions that are binary file formats
Common extensions that are text file formats
Images: jpg, png, gif, bmp, tiff, psd, . Videos: mp4, mkv, avi, mov, mpg, vob, . Audio: mp3, aac, wav, flac, ogg, mka, wma, . Documents: pdf, doc, xls, ppt, docx, odt, . Archive: zip, rar, 7z, tar, iso, . Database: mdb, accde, frm, sqlite, . Executable: exe, dll, so, class, .
Web standards: html, xml, css, svg, json, . Source code: c, cpp, cs, js, py, java, rb, pl, php, sh, . Documents: txt, tex, markdown, asciidoc, rtf, ps, . Configuration: ini, cfg, rc, rgc, . Tabular data: csv, tsv, .
2017 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
188
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.8 Files can be broadly classified as either binary or text. These categories have different characteristics and need different tools to work with such files. Starting with computer memory, every file is translated. A file is a finite-length sequence string of 1s and 0s of (8-bit) bytes, where each byte is an integer between 0 and 255 (represented as 281 or as binary 00000000 to 11111111). The main difference between binary and text is that binary files can be any sequence of bytes and must be opened in an appropriate program that knows the specific file format. DNA sequencing companies such as TWIST Bioscience have their converters from binary into As, Gs, Cs, Ts, the four nucleotides that make up DNA. Text files can be edited in any text editor program. Every text file is indeed a binary file, but this interpretation gives us no useful operations to work with. The reverse is not true, and treating a binary file as a text file can lead to data corruption. 2018 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
Quantum computing As Moore’s law advances, the number of complex problems diminishes and computers become more powerful, and we can do more with them. The trouble is that transistors are just about as small as we can make them: We are getting to the point where the laws of physics seem likely to put a stop to any further expansion. Quantum computing means storing and processing information using individual atoms, ions, electrons, or photons. It is the branch of physics that deals with the world of atoms and the smaller (subatomic) particles inside them. Fig. 7.9 shows the classical configuration of personal computers of today. On the plus side, this opens up the possibility of faster computers, but the drawback is the greater complexity of designing computers that can operate in the weird world of quantum physics. Unfortunately, there are still hugely difficult computing problems that we cannot tackle because even the most powerful computers find them intractable. That is one of the reasons why people are now getting interested in quantum computing. A gentler way to think of qubits (quantum bits) is through the physics concept of superposition (where two waves add to make a third one that contains both originals). If you blow on a flute, the pipe fills up with a wave made up of a fundamental frequencydthe basic note you are playing and lots of harmonics (higher-frequency multiples of the fundamental). The wave inside the pipe contains all
Binary to DNA code, revisited
189
FIGURE 7.9 The conventional configuration of today’s binary computer. Every 2 years we are changing computers and buying a faster and smaller one according to Moore’s law. Intel and the rest of the chip manufacturers are facing a higher hurdle to clear. Internet and cloud computing are the largest storage eaters and starting to control the digital world. 2014 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
these waves simultaneously: They are added together to make a combined wave that includes them all. Qubits use superposition to represent multiple states (multiple numeric values) simultaneously in a similar way. Quantum computers are not intended to replace classical computers; they are expected to be a different tool that we will use to solve complex problems that are beyond the capabilities of a classical computer. Quantum computing often grabs the headlines. The word “quantum” itself is intriguing enough, and when combined with the promise of computational power that surpasses anything we have seen so far, it becomes irresistible. But what exactly is quantum computing? Conventional computers store information using bits represented by 0s or 1s, whereas quantum computers use quantum bits, or qubits, to encode information as 0s, 1s, or both simultaneously.The quantum superposition-which means several objects are running simultaneously-comes also with additional unique properties: 1. Quantum entanglement is a physical phenomenon when several particles are grouped together, so the quantum state of each particle cannot be described independently; instead, a quantum state is described as a whole 2. Quantum tunneling is a phenomenon when particles try to move through a barrier that is physically impassable, like an insulator or a vacuum, or it can be a region of high potential energy. If a particle has insufficient energy to overcome a potential barrier, it simply will not. In the quantum world, however, particles can often behave like waves. Meeting a barrier, a quantum wave will not end abruptly; its amplitude will decrease exponentially, and the number of particles will decrease, but some particles will succeed in going through the barrier.
190
Chapter 7 Synthesizing DNA-encoded data
3. Quantum superposition is a fundamental phenomenon of quantum mechanics where two or more quantum states can be added together “superposed,” and the result will be another valid quantum state. The corollary of this is that every quantum state can be represented as a sum of two or more other distinct states. 4. Quantum annealing is a unique computing type of operation with the objective to solve largescale combinatorial optimization problems.
The QC entanglement process Here is an example to illustrate the concept of the entanglement process: A cat in a closed box with something (say an explosive) that may or may not kill it is in superposition, which means the cat can be either alive or dead. So, until we open the box and observe the cat, we assume half the probability of it being dead and half the probability of it being alive. There is one more important implication to this experiment, that is, what happens within the closed box. If the cat feels the explosion, then it will be dead; if it does not feel the explosion, it will be alive. The state of cat is somehow connected to that of explosive. In the quantum world, this is described as quantum entanglement. While we are only at the beginning of this journey, quantum computing (QC) has the potential to help solve some of the most complex technical, scientific, national defense, and commercial problems that organizations face. We expect that quantum computing will lead to breakthroughs in science, engineering, modeling and simulation, healthcare, financial analysis, optimization, logistics, and national defense applications.
The D-wave systems Founded in 1999, D-Wave Systems, a Canadian quantum computing company, is the world’s first quantum computing company and the leader in the development and delivery of quantum computing systems and software. D-Wave Systems’ mission is to harness the power of quantum computing to solve the world’s most challenging problems. D-Wave Systems are being used by world-class organizations such as Lockheed Martin, Google, NASA, USC, USRA, Los Alamos National Laboratory, Oak Ridge National Laboratory, Volkswagen, and many others. By 2011, D-Wave Systems announced that it had produced a 128-qubit machine as shown in Fig. 7.10. Three years later, Google announced that it hired Dr. John Martinis, a physicist from the University of California at Santa Barbara, to develop its own quantum computers based on D-Wave’s approach. In March 2015, the Google team announced they were “a step closer to quantum computation,” having developed a new way for qubits to detect and protect against errors. In 2016, MIT’s Isaac Chuang and scientists from the University of Innsbruck in Austria unveiled a five-qubit quantum computer that could perform complex prime factorization problems, including encryption busting (see Appendix 7.D, The Bachand DNA story.). Quantum computing is getting the interest of the scientific community, which will eventually deliver a computing revolution. In December 2017, Microsoft unveiled a complete quantum development kit, including the computer language, Q#, developed for quantum applications. In early 2018,
Binary to DNA code, revisited
191
FIGURE 7.10 In 2011, D-Wave displayed its first integrated quantum computing system running on a 128-qubit processor. In May 2013, D-Wave, at the Computing Frontiers 2013 conference, published the first comparison of the technology against regular top-end desktop computers running an optimization algorithm. Using a quantum system with 439 qubits, the system performed 3600 times than a superconventional machine, solving problems with 100 or more variables in half a second compared with half an hour.
D-Wave Systems planned to use quantum power in a cloud computing platform. Google announced the quantum computer Bristlecone, which is based on a 72-qubit array. Google hopes one day to use it to tackle real-world problems. It will likely take several decades before it becomes a reality. One final note on quantum computing: Amazon CEO Jeff Bezos partnered with the CIA with an £18 million investment in the quantum computing firm D-Wave Systems. The primary investors in the financing round were In-Q-Tel, which invests in technology on behalf of the CIA, and Bezos Expeditions, which is Amazon Founder Jeff Bezos’ private investment firm.
Quantum computer versus DNA computer We can say that the comparison of a quantum computer versus DNA computer versus a conventional computer is like comparing a Corvette to a Toyota SUV to a Peterbilt 18-wheeler truck. Each type has unique characteristics with minimum similarities to serve different purposes. Imagine a DNA computer as being a coin. A coin can either be up or down. A quantum computer is like a ball. The state of the ball is whichever part of the ball is touching flat level. There are an infinite number of states of the ball, as opposed to only two states of the coin (up or down). To change the state of the sphere to another state takes a very precise rolling operation, where you need to apply it for just the perfect amount of time to rotate the ball to the intended location. To change the coin, you just flip it. In quantum computing, the ball is much more difficult to work with. It is also much more error prone because the slightest wind will move the ball to another state, while the coin could only be affected by the most immense gale. In much the same way, quantum computers are much more difficult to work with than classical computers (including DNA computers). So, we predict a useful DNA computer coming out before a useful quantum computer.
192
Chapter 7 Synthesizing DNA-encoded data
Here is an example of comparison between quantum and conventional computers. Let us say we had a bucket of dirt, and it was very special dirt and we wanted to move it 2000 miles across the United States. The fastest way we can think of would be to use a military jet fighter and fly it at supersonic speed, like 1000 miles per hour. But suppose the job was to move not a bucket of dirt but 100 tons of dirt. For this job, a train of railway cars will beat the jet fighter. The jet would have to make many thousands of round trips, which could take months, but a train does the entire job in one load. On the other hand, we want to send an email; the quantum computer is hopelessly slow and might never finish. DNA computers are like the supersonic jet in solving graph problems like the Hamiltonian path. Adleman can solve a 1000-node graph in 1 min, whereas a conventional computer will take 20 years. For encryption break problems, nothing beats the DNA computer. A 4096-bit cypher can take 100 years in the fastest conventional computer and will take 10 min in a DNA computer. There is a fundamental difference between quantum computers and conventional computers, which is the computing speed of the quantum computer. A quantum computer processes information in a totally different way. Quantum computers are to classical computers as calculus is to arithmetic. Calculus includes everything in arithmetic, but it introduces many things that are not part of arithmetic so that it can solve different problems. Quantum computers are not doing the same thing as classical computers. They are doing different things. They are fundamentally different machines that work on different mathematical principles. DNA computers, on the other hand, simply do the same thing as a classical computer, just potentially faster. Quantum computers, as they do different things, will not be better at most types of problems. A quantum computer will not help you play games faster or boot faster or anything like that. They will only help you if you are interested in factorization, unstructured search, and quantum simulation. Let us forget that DNA is a formidable medium upon which to store binary data for 10,000 years. It is an incredible processing replicator that can solve any complex problem in a matter of hours compared with days and months on the classical computer.
Coding malware into a strand of DNA The buffer overflow has long been a feature of the computer security landscape. In fact, the first selfpropagating Internet wormd1988’s Morris Wormdused a buffer overflow in the Unix finger daemon to spread from machine to machine. In Unix, finger is a program you can use to find information about computer users. It usually lists the login name, the full name, and possibly other details about the user you are fingering. These details may include the office location and phone number (if known), login time, idle time, time the mail was last read, and the user’s plan and project files. Twenty-seven years later, buffer overflows remain a source of problems. Windows infamously revamped its security focus after the occurrence of two buffer overflowedriven exploits in the early 2000s. Supplying such detailed information as email addresses and full names was considered acceptable and convenient in the early days of networking, but later was considered questionable for privacy and security reasons. The finger program has been used by hackers to initiate a social engineering attack on a company’s computer security system. By using the finger command, a user can get a list of a company’s employee names, email addresses, phone numbers, and so on. A hacker can call or email someone at a company requesting information while posing as another employee. The finger daemon (small program to execute a specific function) has also had several exploitable security holes that crackers have used to break into systems. For example, in 1988, the Morris worm exploited an overflow vulnerability in thousands of business systems.
Mechanics of the buffer overflow
193
Cybersecurity is an arms race, where attackers and defenders play a constantly evolving cat-andmouse game. Every new era of computing has served attackers with new capabilities and more creative malware vectors to execute their nefarious actions. Now that hacking reached DNA, and hackers can slip a buffer overflow payload in the sequencing or synthesis program and modify DNA code.
Mechanics of the buffer overflow Method 1: spilled data At its core, the buffer overflow is an astonishingly simple bug that results from a common practice. Computer programs frequently operate on chunks of data that are read from a file, from the network, or even from the keyboard. Programs allocate finite-sized blocks of memorydbuffersdas you see in Fig. 7.11 to store these data, as they work on it. A buffer overflow happens when more data are written to or read from a buffer than the buffer can hold. Spilled data will corrupt other blocks of data, and pretty soon, you have a digital stampede memory that leads to total disarray and data revolt. The core problem is memory address sequencing and execution. When a user program is loaded into memory, the central processing unit assigns a unique location and tag to every byte (data and program), and at the same time, it allocates a reserved area that calls it the “stack,” which instructs the processing unit on the sequence of execution and return to the stack. A stream of bits can be injected into the stack without return address. The processor will continue to execute instructions outside the buffer, and eventually, the synthesis process will crash.
FIGURE 7.11 Buffer overflow will trigger a hostile malware payload, which will corrupt execution of program.
194
Chapter 7 Synthesizing DNA-encoded data
Method 2: launch pad In this case, the buffer overflow acts as a launch pad for another vector of malware. The program executes, but results will be shocking and detrimental. But here is a scary revelation: There will be some nasty hacking cases where a payload is delivered via a doctored blood sample or even directly from a person’s body. One can imagine a person whose DNA is essentially deadly to poorly secured computers. Dr. Lee Organick eloquently commented: “A doctored biological sample could indeed be used as a vector for malicious DNA to get processed downstream after sequencing and be executed.” CRISPRdremember clustered regularly interspaced short palindromic repeatsdrapidly translating a revolutionary technology into transformative therapies, is a simple yet powerful tool for editing genomes. CRISPR systems are already being used to alleviate genetic disorders in animals and are likely to be employed soon in the clinic to treat human diseases of the eye and blood. It allows researchers to easily alter DNA sequences and modify gene function. CRISPR gene editing technology is explained later in more detail. Its many potential applications include correcting genetic defects, treating and preventing the spread of diseases, and improving crops. However, its promise also raises ethical concerns. This is the point that we are trying to highlight: Hackers with evil intentions can use CRISPR genome editing capability to cut any region of DNA and insert a virus. The researchers programmed the CRISPR molecules to release a fluorescent signal when they were chopping away at the viruses, so that the presence of the virus could be detected.
Sherlock software A new diagnostic tool for cancer with the potential to detect minuscule concentrations of mutant DNA in a patient’s blood has been created by the same team partly credited with developing the CRISPR gene editing technique. Sherlock was the name given to the detective software “Specific High Sensitivity Enzymatic Reporter Unlocking" (see Appendix 7.F, Sherlock Detective Software). Dr. Francis Collins from the Institute of Health (IOH) and his team were able to see the presence of the viruses even in extremely low concentrations, as low as two molecules in a quintillion. In a separate test, Sherlock was able to detect two different strains of the antibiotic-resistant superbug Klebsiella pneumoniae. Then in June 2017, Dr. Yoon-Seong K im, an associate professor from the University of Central Florida, and his team reported in the journal Scientific Reports that they had used a CRISPR system to detect the presence of Parkinson’s disease. This disorder of the central nervous system causes malfunction and death of nerve cells in the brain and gets worse over time, causing tremors and problems with movement. The disease affects about 1 million people in the United States, according to the Parkinson’s Disease Foundation. The corollary of this is that CRISPR could be used to inject malware into the edited DNA. This is a very serious issue that requires the highest attention. It must be a joint effort between geneticists and computer scientists.
The creative mind of the hacker DNA synthesis and sequence code can be vulnerable during the binary transcription process. In the future, there will be some interesting and “creative” DNA hacking that will create so many chuckholes in the way of clean DNA storage. Here is an interesting story on how creative the hacking mind can bedsomething that could be the tip of the hacking iceberg. The British newspaper on April 2017
CRISPR, the gene editor
195
published an interesting story about ingenious theft in the prison. On April 11, 2017, the State of Ohio Office of the Inspector General released a damning report on how the five inmates managed to build two computers from assorted parts, connected to the Internet, downloaded pornography, and created fake credit cards loaded with illicitly obtained tax refunds. They also found bitcoin wallets and articles about making homemade drugs and plastic explosives. You could not make this story up. The complete report of the theft can be downloaded from https://regmedia.co.uk/2017/04/12/ohio_inspector_ general_report.pdf.
The next generation of DNA hacking But the most perplexing phenomenon in the age of Internet is the following syndrome: How come malware Marines always run faster than cyber defense wardens and outsmart them in every round? Well, it is easier to destroy than to build. It took 10,000 workers and 5 years to build the World Trade Center but 19 terrorists to bring the two towers down in 2 h. Let us put physics aside and examine what makes the cortex of the hacker an anatomical marvel. The hacker is an invader and not a crusader. A hacker is a person capable of commanding and controlling a system. The hacker is an intelligent “lateral” thinker. Like a car mechanic, the hacker has the appropriate tools for the appropriate tack. The hacker accumulates new tools from hacking friends. The hacker is an avid reader and excavator. The hacker’s cognitive skills provide him or her with a realistic plan before jumping into the fire. Strategic hackers go after vulnerable systems with holes and defects. Hackers are loyal to themselves and have very few friends. Like engines, they measure more than twice before they cut. Hackers learn from their mistakes and keep their expertise to themselves. Obviously, they use the surprise element to break into firewalls. Hackers are a copy of Navy Seals, very adventurous, and courageous by taking grave risks. Hackers have a high-definition vision, predictive analytics, and a phenomenal mental “what if” simulation scenario of the assault. They thrive on success and not failure. Hackers have a great opportunity to become security gurus and draw excellent salaries and recognition. Scientific American magazine, in its November 2017 issue, published an eye-popping article titled, “Mail-Order CRISPR Kits Allow Absolutely Anyone to Hack DNA.” Within a few years, there will be some smart bioinformaticians who will know how to bridge CRISPR from editing DNA genomes into programming CRISPR into hacking the binary to ATCG code. The big challenge is to know how to stop such malice and create a counter hacking tool.
CRISPR, the gene editor Viruses are a common threat to cellular life as well as to bacteria that constitute the majority of life on earth. Consequently, a variety of mechanisms to eradicate virus infection has been in progress. A recent discovery is the adaptive immune system called CRISPR-Cas, which provides specific adaptive immunity by integrating short virus sequences (let us call them vaccines) in the cell’s infectious area allowing the cell to remember, recognize, and clear infections. “CRISPR” stands for clusters of regularly interspaced short palindromic repeats. It is a simple and powerful tool for editing genomes (refer to Fig. 7.12). It allows researchers, geneticists, and even DNA terrorists to easily alter DNA sequences and modify gene function. CRISPR is a specialized stretch of DNA. The protein Cas9 (or “CRISPR-associated”) is an enzyme that acts like a pair of scissors, capable of cutting defective strands of DNA, treating and preventing the spread of diseases, and improving crops. However, its promise also raises ethical concerns.
196
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.12 Genome hacking by CRISPR-Cas9. The strategy of genome hacking is the cleavage of double-stranded DNA at a selected position on the genome. The CRISPR RNA (crRNA), having a sequence homologous to the target site, and trans-activating CRISPR RNA (tracrRNA) are enough to bring the Cas9 nuclease to the target site. Once the Cas9-sgRNA complex cleaves the target gene, it is easy to disrupt the function of the gene by a deletion or insertion mutation. This overwhelmingly simple method is now rapidly spreading as a practical genomic hacking technique.
Yes indeed, it has the potentiality of compromising the binary code to DNA storage synthesis process. CRISPR is like a vaccine that contains diluted pathogens for the purpose of inoculation. CRISPR is made of a sequence of DNA that can be coded with evasive malware such as DeepLocker to harm DNA computers.
Who is the DNA cyber hacker? No one has ever described who exactly is the DNA cybercriminal and what are the qualifications to become one. Hackers once tinkered with ATT digital phonesdfollowers of Captain Crunchdmaking millions of free calls in the 1960s and 1970s; then they started using modems and hacking into mainframes in the 1970s and 1980s; then they exploited the LAN technology and the burgeoning Internet in the 1980s. Malware writers moved from boot-sector viruses on floppy disks to file-infector viruses and then to macroviruses in the 1990s and vigorously exploited worms and Trojans for attacks. Then, in the midyears of 2000, artificial intelligence (AI) and molecular nanotechnology prevailed, and systems became “driverless,” autonomic self-learners. The shift to machine learning and AI is the next major progression in IT. Cybercriminals are also studying AI to use it to their advantagedand weaponize it. The use of AI will definitely be a game changer in cyberattacks. The characteristics of AI-powered attacks will also be different, and our defense against them will require a new strategy symmetrical with the attacks. Cyber hackers switched gear and started to build AI malware software that outsmarts enterprise systems. In fact, AI became weaponized with highly sophisticated “machine learning” features and asymmetrically surpassed banking systems and social media systems. Google, Facebook, Twitter, and Instagram lost millions of
Type 1: artificial intelligenceepowered malware
197
user data. Foreign governments have realized the might of AI, have their finger in the political pie, and have started meddling with elections of their adversaries. Of course, they deny their mischief, and instead they attend the funerals of their victims. Privacy has jumped to the top of the security list. Privacy is the attribute that keeps something completely invisible from the public, preferably in dormant mode until it is retrieved with the proper method by the right owner. Modern cybersecurity engineering will be heavily immersed in AI and nanotechnology tools. Let us talk about AI first.
Type 1: artificial intelligenceepowered malware DeepLockerdThe smart software AI-powered crime-as-a-service has facilitated massive data larcenies. Pretty soon, DNA binary storage will be the new target. Encryption is not cognitive and intelligence void. IBM has come up with an interesting product that can be used to inject a malware payload into a host application, and the same product can be used to detect and alarm the user about a malware vector that resides in his or her host application. The name of this magical smart code is DeepLocker (see Appendix 7.B, IBM’s DeepLockerdThe Smart Software). It is the new AI-powered generation of malware, which is highly targeted and evasive. DeepLocker’s deep neural network (DNN) model stipulates “trigger conditions” to fire a payload as shown in Fig. 7.13 (Appendix 7.E, Deep neural network). If these conditions are not metdand the target is not founddthen the malware remains locked up, which IBM says makes the malicious code “almost impossible to reverse engineer.” To demonstrate DeepLocker’s potential, security researchers created a trial test in which the ransomware WannaCry was hidden in a video conferencing application. The malware was not detected by antivirus engines or sandboxing.
FIGURE 7.13 DeepLocker is an AI-powered smart machine. It has three layers of attack concealment. The first concealment trigger is to hide the target. The second concealment trigger is to hide the time of the attack. The third concealment trigger is to hide the nature of the payload. DeepLocker is triggered by people’s faces or some other visual clues. AI, artificial intelligence. MERIT CyberSecurity Consulting DNA Engineering.
198
Chapter 7 Synthesizing DNA-encoded data
This is not the first time IBM has presented research about the perils of artificial intelligence. At the RSA Conference in San Francisco on April 16, 2018, IBM outlined ways that an attacker could manipulate machine learning models to corrupt results and influence outcomes. DeepLocker could be embedded into a legitimate application that is widely distributed, according to Dr. Marc Stoecklin, principal research scientist and manager, Cognitive Cybersecurity Intelligence, IBM Research. He described the strategy of DeepLocker: “The malware only deploys when certain conditions are met, such as being installed on a particular device or even when a specific end user logs in. The AI component keeps the malware hidden and is used to understand when the benign application is deployed on the right target . Basically, we can train the AI component to recognize a specific person, a specific victim or targetdand only when that person is sitting in front of a computer and can be recognized via the web cam, then a software ‘trigger’ will allow the malware unit to unlock the malicious behavior.” The trigger can be anythingdfacial recognition, behavioral biometrics, or the presence of an application on the system to help target a specific group or company. Stoecklin said that DeepLocker is entirely enclosed within the host application, and it does need to connect to the Internet to deliver its malware payload. Remember Stuxnet? In the first-generation malware, we had the successful Trojan horse attacksda type of malware that is often disguised as legitimate software. Once the Trojan is inside the victim system (in memory), it will send fake messages to distract the administrator and give the backdoor virus enough time to sneak into the system. The Trojan will open TCP port 31337 to allow the client version of Back Orifice (BO) malware to scoot into the target system. The server section of the BO is located at a remote site. The BO’s entry in the system is totally stealth and does nt show up in the task list of the victim system. The asymmetric BO rootkit is ready to do its evil things: sniff passwords, record keystrokes, access a desktop’s file system, all while remaining undetected. Users of the victim system are typically tricked by some form of social engineering into loading and executing Trojans on their system. One way to detect DeepLocker in the victim system is for the host application to have higher intelligence than DeepLocker. That will require another AI-powered software to beat AI-powered malware. DNA code is not immune from AI-powered malware such as DeepLocker. In the next few years, synthesis and sequencing machines will be equipped with AI-powered SmartWare to protect the integrity of binary and DNA coding.
Type 2: nanopowered malware DNA malware-as-a-service The evolution of nanotechnology has been described as the next Industrial Revolution, ushering in radical new innovations in science, technology, and lifestyle. It has the potential to impact every aspect of our lives. Nanotechnology is the science, engineering, and technology conducted at the nanoscale, which is about 1e100 nm. It is, in fact, the molecular juggernaut that will proliferate into several fields such as medicine, engineering, manufacturing, and DNA hacking. In 2000, Bill Joy, cofounder of Sun Microsystems, wrote an article describing self-replicating nanobots that over time destroyed the human race reducing the world to a mass called “gray goo.” Also, Michael Crichton wrote a novel called Prey, which has nanoparticles causing mass hysteria. Many fear that nanotechnology will develop too rapidly and proper safeguards will not be in place to protect the public. Some would like
How could DNA attack a computer?
199
nanotechnology production stopped so it can be properly studied and, from this, develop safety regulations and guidelines. It is not going to happen. Nanotechnology and DNA are intersecting, and a new field has emerged: DNA technology. It is defined as the design and manufacturing of artificial nucleic acid (DNA building blocks) structures for nonbiological engineering materials for nanotechnology rather than as the carriers of genetic information in living cells. Nanotechnology has made striking advances in medicine and biology, bioengineering, and electronics. Now it is the time for DNA technology to step into the cyber malware domain. We are going to have a new breed of criminals, called nanohackers, who will be able to acquire the experience of nanotechnology and turn around to use it in evil ways. These criminals will learn how to counterattack digital immunity and learn how to use DNA malware to disrupt the peaceful living of smart city citizens. DNA hackers have been looking for the right opportunity to create their own DNA nanomalware. Cyberterrorism world has moved to greener pastures. It is not difficult for biohackers to design malware vectors from AGCT base pair sequences and use them as a DNA malware-as-a-service. You will be surprised how many terrorist organizations would be interested in using the service for DNA attacks. Nanorobots can be custom-designed for attacks at any scale. Also, they are an attractive product to market malware payload implants. Nanorobots will be discussed in the next section.
DNA hacking with nanorobots A nanorobot is an autonomic preprogrammed structure of atomic level. It belongs to the nanorobotics engineering discipline. It is a machine that can build and manipulate things precisely. It is a robot that can arrange atoms like Lego pieces and is able to build any structure from basic atomic elements such as C, N, H, O, P, Fe, Ni, and so on. In fact, we should realize that each of us is alive today because of billions of nanobots operating inside our trillions of cells. We give them biological names like “ribosome,” but they are essentially machines programmed with a function like “read DNA sequence and create a specific protein.” The DNA processing pipeline begins with DNA strands in a test tube. Hence, we start our security explorations from this point. The first option is to evaluate if we can compromise a computer program using physical DNA. Nanorobots are activated by logical AND gates (circuits that produce an output signal only when signals are received simultaneously through all input connections) to demonstrate their ability to navigate nanorobots’ functions. A nanorobot is any active structure that is capable of the following functions: actuation, sensing, manipulation, propulsion, signaling, and information processing at the nanoscale. But most importantly, nanorobots can participate in a variety of DNA hacking mischiefs. It can happen during the binary synthesis stage by merging the payload with the code of the target program. The payload, after its entry in the host strand, will open a socket for remote control, pretty much like conventional hacking. Moreover, the payload starts producing copies of itself to replace worn-out units, a process called self-replication.
How could DNA attack a computer? The first step in setting up the DNA attack is to synthesize a real biological DNA sequence embedded with a malicious exploit. The second step is to examine the loading of the exploit into the target using the normal sequence entry method. The third step is to use a DNA sequence analyzer to evaluate the impact of the exploit.
200
Chapter 7 Synthesizing DNA-encoded data
FASTQ compression utility program was used as the exploit tool. FASTQ is designed to compress DNA sequences (see Fig. 7.14, sample Python program). See Appendix 7.C later in this chapter for more on FASTQ. In the next decade, the world will have the same dilemma of today. Nanotechnology will visit the cybercrime and terrorism and get into the hands of cybercriminals.
FIGURE 7.14 The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. Hackers can use FASTX to insert an exploit into the short reads and trip the whole sequencing process. MERIT Cyber Security Consulting is collaborating with the University of Dubai to develop a prototype of DNA storage
Appendices Appendix 7.A Bioinformatics Bioinformatics is a hybrid interdisciplinary science that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics is a huge umbrella that covers biology, computer science, information engineering, programming languages, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has a collection of “public” knowledge bases that collect DNA and RNA sequences from scientific papers and genome projects for research done by international consortia such as the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL-Bank) in the
Appendices
201
United Kingdom, the DNA Data Bank of Japan (DDBJ), and GenBank of the National Center for Biotechnology Information (NCBI) in the United States, which oversees the International Nucleotide Sequence Database Collaboration (INSDC). Bioinformatics also has genome browsersdrepositories that store genomic and molecular information about a particular species. Also, under the domain of bioinformatics, there are knowledge engines worth mentioning: the worldwide Protein Data Bank (wwPDB), a joint effort of the Research Collaboratory for Structural Bioinformatics (RCSB) in the United States, the Protein Data Bank Europe (PDBe) at the European Bioinformatics Institute in the United Kingdom, and the Protein Data Bank Japan at Osaka University. The homepages of these institutions include links to retrieve pertinent data files and browse through expositories and tutorial material, as well as a specialized search engine for specific retrieving structures. Information retrieval from the data archives utilizes standard tools for identification of data items by keyword; for instance, one can type “aardvark myoglobin” into Google and retrieve the molecule’s amino acid sequence. Other algorithms search data banks to detect similarities between data items. For example, a standard problem is to probe a sequence database with a gene or protein sequence of interest to detect entities with similar sequences.
Goals of bioinformatics Bioinformatics includes a family of algorithms to measure sequence similarity. The Needlemane Wunsch algorithm, which is based on dynamic programming, guarantees finding the optimal alignment of pairs of sequences. This algorithm essentially divides a large problem (the full sequence) into a series of smaller problems (short sequence segments) and uses the solutions of the smaller problems to construct a solution to the large problem. Similarities in sequences are scored in a matrix, and the algorithm allows for the detection of gaps in sequence alignment. There is another well-known program called BLAST. BLAST is a search program that enables a researcher to compare a query sequence with a library or database of sequences and identify library sequences that resemble the query sequence above a certain threshold. We talked about it in detail earlier in this chapter. Another goal of bioinformatics is the prediction of protein structure from an amino acid sequence. The spontaneous folding of proteins shows that this should be possible. Progress in the development of methods to predict protein folding is measured by biennial Critical Assessment of Structure Prediction (CASP) programs, which involves blind tests of structure prediction methods. Bioinformatics is also used to predict interactions between proteins, given individual structures of the partners. Computer programs simulate these interactions to predict the optimal spatial relationship between binding partners. A particular challenge, one that could have important therapeutic applications, is to design an antibody that binds with high affinity to a target protein. In the beginning, bioinformatics research had a relatively narrow focus, concentrating on the development of algorithms to analyze special gene sequences or protein structures. But now, bioinformatics has a broader view to analyze the relationship between the different types of data to understand natural phenomena, including organisms and disease.
Appendix 7.B IBM’s DeepLockerdthe AI-powered malware DeepLocker was developed, by IBM Research, to demonstrate how several existing AI and malware software modules (like Lego components) could be welded together, to create a highly evasive
202
Chapter 7 Synthesizing DNA-encoded data
new breed of malware, which hides its malicious attack until it reaches its specific target. DeepLocker uses the DNN AI model to hide its payload in benign host applications, while the payload will only be unlocked ifdand only ifdthe intended target is reached. DeepLocker uses several attributes for target identification, including visual, audio, geolocation, and system-level features. In contrast to existing evasive and targeted malware, this method would make it extremely challenging to reverse-engineer the benign host software and recover the mission-critical secrets, as well as the logic of the payload and the target specifics. We will perform a live demonstration of a proof-of-concept implementation of a DeepLocker malware, in which we camouflage well-known ransomware in a benign application such that it remains undetected by malware analysis tools, including antivirus engines and malware sandboxes. We will discuss technical details, implications, and use cases of DeepLocker. More importantly, we will share countermeasures that could help defend against this type of attack in the wild.
Appendix 7.C FASTQ software FASTQ was developed by the Wellcome Sanger Institute (WSI), a nonprofit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome Genome Campus by the village of Hinxton, outside Cambridge, and shares this location with the European Bioinformatics Institute. WSI was established in 1992 and named after double-Nobel Laureate, Frederick Sanger. It was conceived as a large-scale DNA sequencing center to participate in the Human Genome Project and went on to make the largest single contribution to the gold standard sequence of the human genome. From its inception, the institute established and has maintained a policy of data sharing and does much of its research in collaboration. FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal “government” definition to date, and existing with three incompatible variants. Here, we are defining the original Sanger format FASTQ, which was agreed upon by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava, and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format. In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format: the ability to store a numeric quality score associated with each nucleotide in a sequence. No doubt because of its simplicity, the FASTQ format has become widely used as a simple interchange file format. Unfortunately, history has repeated itself, and the FASTQ format suffers from the absence of a clear definition and several incompatible variants.
Appendix 7.D The Bachand DNA story Marlene and George Bachand pioneers of DNA-encrypted data (Fig. 7.15) Using a practically unbreakable encryption key, the marital team has encoded an abridged version of a historical letter written by President Harry Truman into DNA. They then made the DNA, spotted it
Appendices
203
FIGURE 7.15 Marlene and George Bachand, two biological engineers at Sandia, stated, “We are taking advantage of a biological component, DNA, and using its unique ability to encode huge amounts of data in an extremely small volume to develop DNA constructs that can be used to transmit and store vast amounts of encrypted data for security purposes.” George is holding a large book and Iomega storage, and Marlene is showing all these data can be stored in one small DNA tube. Courtesy of https://share-ng.sandia.gov/news/resources/news_releases/dna_storage/.
onto Sandia letterhead and mailed itdalong with a conventional letterdaround the country. After the letter’s cross-country trip, the Bachands were able to extract the DNA out of the paper, amplify and sequence the DNA, and decode the message in about 24 h, all at a cost of about $45. Encryption can be added to the translation of binary to DNA code. It is a viable precaution against DNA hacking.As shown in Fig. 7.16.
FIGURE 7.16 The process of DNA encryption; it gets synthesized and then sequenced in the same order as text and is finally returned to binary. Encrypting DNA code makes it doubly difficult for hackers to break, but hackers use DNA to created very efficient malware for the medical community.
204
Chapter 7 Synthesizing DNA-encoded data
To achieve this proof-of-principle, the first step was to develop the software to generate the encryption key and encrypt text into a DNA sequence. DNA is made up of four different bases, commonly referred to by their one-letter abbreviations: A, G, C, and T. Using a three-base code, exactly how living organisms store their information, 64 distinct charactersdletters, spaces, and punctuationdcan be encoded, with room for redundancy. Using a computer algorithm, they can encrypt a message into a sequence of DNA. Then they chemically synthesize the DNA. The DNA can be read by DNA sequencing and then translated and decoded using the same computer algorithm. Credit: Sandia National Laboratories. The team’s first test was to encode a 180-character message, about the size of a tweet. Encoding the message into 550 bases was easy; actually, making the DNA was hard. George Bachand said, “There’s a new technology that’s come out and made the ability to take synthetic DNA, what are called gene blocks, and stitch them together into these artificial chromosomes. Now it is possible to readily make these gene blocks right on the bench top and it can be done in large, production-scale pretty quickly.” Since successfully encoding, making, reading, and decoding the 180-character message and the 700-character Truman letter, the Bachands are now working on even longer test sequences. However, what the Bachands really want to apply their technique to is national security problems. “We have achieved the proof-of-principle. Now the big challenge for us is identifying the potential applications.” Two possible applications that the team has identified are storing historical classified documents and barcoding/watermarking electromechanical components for Sandia’s Department of Defensee certified fabrication facility. George Bachand imagines encoding each component’s historydwhen it was manufactured, the lot number, starting material, even the results of reliability testsdinto DNA and spotting it onto the actual chip. To test the feasibility, Marlene Bachand spotted lab equipment with a test message and was able to recover and decode the message, even after months of daily use and routine cleaning. DNA spotted onto electronic components and stored in cool, dark environments could be recoverable for hundreds of years. Another application for the Bachands’ DNA storage method would be for historical or rarely accessed classified documents. DNA requires much less maintenance than disk- or tape-based storage and does not need lots of electricity or tons of space like cloud- or paper-based storage. But conversion of paper documents into DNA requires the “cumbersome” process of scanning, encrypting, and then synthesizing the DNA, admitted George Bachand. Making the DNA is the most expensive part of the process, but the cost has decreased substantially over the past few years and should continue to drop. Marlene Bachand commented, “I hope this project progresses and expands the biological scope and nature of projects here at Sandia. I believe the field of biomimicry has no boundaries. Given all of the issues with broken encryption and data breaches, this technology could potentially provide a path to address these timely and ever-increasing security problems.”
Appendix 7.E Deep neural network It has been a common belief that simple models (simulation, mathematical, etc.) provide higher interpretability than complex ones. Straightforward linear models or basic decision trees still dominate in many applications for this reason. This belief is, however, challenged by recent work, in which carefully designed interpretation techniques have shed light on some of the most complex and deepest machine learning models. Fig. 7.17 shows the relationship of AI and its new derivative technologies.
Appendices
205
FIGURE 7.17 The hierarchical relationship between AI and deep learning. A model of a deep neural network with multiple hidden layers, which will lead to more accurate results. 2014 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
A standard neural network (NN) consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors perceiving the environment; other neurons get activated through weighted connections from previously active neurons. Some neurons may influence the environment by triggering actions (see Figs. 7.18 and 7.19).
FIGURE 7.18 Any deep neural network will consist of three types of layers: the input layer, the hidden layer, and the output layer. This diagram shows that the first layer is the input layer that receives all the inputs and the last layer is the output layer that provides the desired output. All the layers in between these layers are called hidden layers. There can be (n) number of hidden layers thanks to the high-end resources available these days. The number of hidden layers and the number of perceptrons in each layer will entirely depend on the use-case you are trying to solve.
206
Chapter 7 Synthesizing DNA-encoded data
FIGURE 7.19 Computational neural network is an amazing replication of the human neural network. It is one of the most revealing inventions that humans have copied from their brain.
Appendix 7.F Sherlock detective software Sherlock’s Toolkit, developed by MIT Lincoln Laboratory, is an open-source, scalable system that comes with two powerful modules: first, short tandem repeat (STR), to check the number of sequence repeats in DNA; second, single-nucleotide polymorphism (SNP), to check the position of DNA’s ladder step in the genome. The toolkit includes modules for a range of kinship, biogeographic ancestry, replicate, and mixture analyses.
Appendix 7.G Glossary (deep neural networks) The intent of this glossary is to provide clear definitions of the technical terms specific to deep artificial neural networks. It is a work in progress. Activation: An activation, or activation function, for a neural network is defined as the mapping of the input to the output via a nonlinear transform function at each “node,” which is simply a locus of computation within the net. Each layer in a neural net consists of many nodes, and the number of nodes in a layer is known as its width. Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases works on the input data from the previous layer to determine whether that signal surpasses a given threshold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e., the level of nodes’ activation change in the course of learning. Deeplearning4j includes activation functions such as sigmoid, relu, tanh, and ELU. These activation functions allow neural networks to make complex boundary decisions for features at various levels of abstraction.
Appendices
207
Adadelta: An updater, or learning algorithm, related to gradient descent. Unlike SGD, which applies the same learning rate to all parameters of the network, Adadelta adapts the learning rate per parameter. Adagrad: Short for adaptive gradient, Adagrad is an updater or learning algorithm that adjusts the learning rate for each parameter in the net by monitoring the squared gradients in the course of learning. It is a substitute for SGD and can be useful when processing sparse data. Adam: Adam (Gibson) cocreated Deeplearning4j. Adam is also an updater, similar to RMSProp, which uses a running average of the gradient’s first and second moment plus a bias-correction term. Affine layer: Affine is a fancy word for a fully connected layer in a neural network. “Fully connected” means that all the nodes of one layer connect to all the nodes of the subsequent layer. A restricted Boltzmann machine, for example, is a fully connected layer. Convolutional networks use affine layers interspersed with both their namesake convolutional layers (which create feature maps based on convolutions) and downsampling layers, which throw out a lot of data and only keep the maximum value. “Affine” derives from the Latin affinis, which means bordering or connected with. Each connection, in an affine layer, is a passage whereby input is multiplied by a weight and added to a bias before it accumulates with all other inputs at a given node, the sum of which is then passed through an activation function: e.g., output ¼ activation (weight input + bias), or y ¼ f(w x + b). AlexNet: A deep convolutional network named after Alex Krizhevsky, a former student of Geoff Hinton’s at the University of Toronto, now at Google. AlexNet, was used to win ILSVRC 2012 and foretold a wave of deep convolutional networks that would set new records in image recognition. AlexNet is now a standard architecture. It contains five convolutional layers, three of which are followed by max-pooling (down sampling) layers, two fully connected (affine) layersdall of which ends in a softmax layer. Attention models: Attention models “attend” to specific parts of an image in sequence, one after another. By relying on a sequence of glances, they capture visual structure, much like the human eye is believed to function with foveation. This visual processing, which relies on a recurrent network to process sequential data, can be contrasted with other machine vision techniques that process a whole image in a single, forward pass. Autoencoder: Autoencoders are at the heart of representation learning. They encode input, usually by compressing large vectors into smaller vectors that capture their most significant features; that is, they are useful for data compression (dimensionality reduction) as well as data reconstruction for unsupervised learning. A restricted Boltzmann machine (RBM) is a type of autoencoder, and in fact, autoencoders come in many flavors, including variational autoencoders, denoizing autoencoders, and sequence autoencoders. Variational autoencoders have replaced RBMs in many labs because they produce more stable results. Denoizing autoencoders provide a form of regularization by introducing Gaussian noise into the input, which the network learns to ignore in search of the true signal. Backpropagation: To calculate the gradient that relates weights to error, we use a technique known as backpropagation, which is also referred to as the backward pass of the network. Backpropagation is a repeated application of chain rule of calculus for partial derivatives. The first step is to calculate the derivatives of the objective function with respect to the output units, then the derivatives of the output of the last hidden layer to the input of the last hidden layer, and then derivatives of the input of the last hidden layer to the weights between it and the penultimate hidden layer.
208
Chapter 7 Synthesizing DNA-encoded data
A special form of backpropagation is called backpropagation through time, or BPTT, which is specifically useful for recurrent networks analyzing text and time series. With BPTT, each time step of the recurrent neural network is the equivalent of a layer in a feed-forward network. To backpropagate over many time steps, BPTT can be truncated for the purpose of efficiency. Truncated BPTT limits the time steps over which error is propagated. Batch normalization: Batch normalization does what it says: it normalizes minibatches as they are fed into a neural net layer. Batch normalization has two potential benefits: It can accelerate learning because it allows you to employ higher learning rates and also regularizes that learning. Bidirectional recurrent neural networks (RNNs): A bidirectional RNN is composed of two RNNs that process data in opposite directions. One reads a given sequence from start to finish; the other reads it from finish to start. Bidirectional RNNs are employed in NLP for translation problems, among other use cases. Deeplearning4j’s implementation of bidirectional Graves LSTMs is here. Binarization: The process of transforming data into a set of 0s and 1s. An example would be grayscaling an image by transforming a picture from the 0e255 spectrum to a 0e1 spectrum. Boltzmann machine: A Boltzmann machine learns internal (not defined by the user) concepts that help to explain (that can generate) the observed data. These concepts are captured by random variables (called hidden units) that have a joint distribution (statistical dependencies) among themselves and with the data and that allow the learner to capture highly nonlinear and complex interactions between the parts (observed random variables) of any observed example (like the pixels in an image). You can also think of these higher-level factors or hidden units as another, more abstract, representation of the data. The Boltzmann machine is parametrized through simple two-way interactions between every pair of random variables involved (the observed ones as well as the hidden ones). Channel: A word used when speaking of convolutional networks. ConvNets treat color images as volumes; that is, an image has height, width, and depth. The depth is the number of channels, which coincide with how you encode colors. RGB images have three channels, for red, green, and blue, respectively. Class: Used in classification, a class refers to a label applied to a group of records sharing similar characteristics. Confusion matrix: Also known as an error matrix or contingency table. Confusion matrices allow you to see if your algorithm is systematically confusing two labels, by contrasting your net’s predictions against a benchmark. Contrastive divergence: Contrastive divergence is a recipe for training undirected graphical models (a class of probabilistic models used in machine learning). It relies on an approximation of the gradient (a good direction of change for the parameters) of the log-likelihood (the basic criterion that most probabilistic learning algorithms try to optimize) based on a short Markov chain (a way to sample from probabilistic models) started at the last example seen. It has been popularized in the context of restricted Boltzmann machines, the latter being the first and most popular building block for deep learning algorithms, https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine. Convolutional network (CNN): A deep neural network that is currently the state-of-the-art in image processing. They are setting new records in accuracy every year on widely accepted benchmark contests like ImageNet. From the Latin convolvere, “to convolve” means to roll together. For mathematical purposes, a convolution is the integral measuring of how much two functions overlap as one passes over the other. Think of a convolution as a way of mixing two functions by multiplying them: a fancy form of multiplication.
Appendices
209
Imagine a tall, narrow bell curve standing in the middle of a graph. The integral is the area under that curve. Imagine near it a second bell curve that is shorter and wider, drifting slowly from the left side of the graph to the right. The product of those two functions’ overlap at each point along the x-axis is their convolution. So, in a sense, the two functions are being “rolled together.” Cosine similarity: It turns out two vectors are just 66% of a triangle, so let us do a quick trig review. Trigonometric functions such as sine, cosine, and tangent are ratios that use the lengths of a side of a right triangle (opposite, adjacent, and hypotenuse) to compute the shape’s angles. We can also know the angles at which those sides intersect. Differences between word vectors, as they swing around the origin like the arms of a clock, can be thought of as differences in degrees. And similar to ancient navigators gauging the stars by a sextant, we will measure the angular distance between words using something called cosine similarity. You can think of words as points of light in a dark canopy, clustered together in constellations of meaning. To find that distance knowing only the word vectors, we need the equation for vector dot multiplication (multiplying two vectors to produce a single, scalar value). Cosine is the angle attached to the origin, which makes it useful here. (We normalize the measurements, so they come out as percentages, where 1 means that two vectors are equal and 0 means they are perpendicular, bearing no relation to each other.) Data parallelism and model parallelism: Training a neural network on a very large dataset requires some form of parallelism, of which there are two types: data parallelism and model parallelism. Let us say you have a very large image dataset of 1,000,000 faces. Those faces can be divided into batches of 10, and then 10 separate batches can be dispatched simultaneously to 10 different convolutional networks, so that 100 instances can be processed at once. The 10 different CNNs would then train on a batch, calculate the error on that batch, and update their parameters based on that error. Then, using parameter averaging, the 10 CNNs would update a central, master CNN that would take the average of their updated parameters. This process would repeat until the entire dataset has been exhausted. Data science: The discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference. Deep belief network (DBN): A stack of restricted Boltzmann machines, which are themselves a feed-forward autoencoder that learns to reconstruct input layer by layer, greedily. Pioneered by Geoff Hinton and crew. Because a DBN is deep, it learns a hierarchical representation of input. Because DBNs learn to reconstruct that data, they can be useful in unsupervised learning. Deep learning: Deep learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech, and audio, while recurrent nets have shone light on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection
210
Chapter 7 Synthesizing DNA-encoded data
or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but nonlinear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. Distributed representations: The Nupic community has a good explanation of distributed representations here. Other good explanations can be found on this Quora page: https://www.quora.com/ Deep-Learning-What-is-meant-by-a-distributed-representation.www.quora.com/Deep-LearningWhat-is-meant-by-a-distributed-representation"/> Downpour stochastic gradient descent: An asynchronous stochastic gradient descent procedure, employed by Google, among others, that expands the scale and increases the speed of training deep learning networks. DropConnect: A generalization of dropout for regularizing large fully connected layers within neural networks. Dropout sets a randomly selected subset of activations to 0 at each layer. DropConnect, in contrast, sets a randomly selected subset of weights within the network to 0. See regularization of neural networks using DropConnect at https://cds.nyu.edu/wp-content/uploads/2014/04/ dropc_slides.pdf. Dropout: A hyperparameter used for regularization in neural networks. Like all regularization techniques, its purpose is to prevent overfitting. Dropout randomly makes nodes in the neural network “dropout” by setting them to 0, which encourages the network to rely on other features that act as signals. That, in turn, creates more generalizable representations of data. Also, dropout is a simple way to prevent neural networks from overfitting. See Recurrent neural networks at https://en.wikipedia.org/ wiki/Recurrent_neural_network. Embedding: An embedding is a representation of input, or an encoding. For example, a neural word embedding is a vector that represents that word. The word is said to be embedded in vector space. Word2vec and GloVe are two techniques used to train word embeddings to predict a word’s context. Because an embedding is a form of representation learning, we can “embed” any data type, including sounds, images, and time series. Epoch: A complete pass through all the training data. A neural network is trained until the error rate is acceptable, and this will often take multiple passes through the complete dataset. Note: An iteration is when parameters are updated and is typically less than a full pass. For example, if batch size is 100 and data size is 1000, an epoch will have 10 iterations. If trained for 30 epochs, there will be 300 iterations. Epoch versus iteration: In machine-learning parlance, an epoch is a complete pass through a given dataset. That is, by the end of one epoch, your neural networkdbe it a restricted Boltzmann machine, convolutional net, or deep belief networkdwill have been exposed to every record within the dataset once. Not to be confused with an iteration, which is simply one update of the neural net model’s parameters. Many iterations can occur before an epoch is over. Epoch and iteration are only synonymous if you update your parameters once for each pass through the whole dataset; if you update using minibatches, they mean different things. Say your data have two minibatches: A and B iterations (3) perform training like AAABBB, while three epochs look like ABABAB. Extract, transform, load (ETL): Data are loaded from disk or other sources into memory with the proper transforms such as binarization and normalization. Broadly, you can think of a data pipeline as the process over gathering data from disparate sources and locations, putting it into a form that your algorithms can learn from, and then placing it in a data structure that they can iterate through.
Appendices
211
f1 score: The f1 score is a number between 0 and 1 that explains how well the network performed during training. It is analogous to a percentage, with 1 being the best score and 0 the worst. f1 is basically the probability that your net’s guesses are correct. F1 ¼ 2 ððprecision recallÞ=ðprecision þ recallÞÞ Accuracy measures how often you get the right answer, while f1 scores are a measure of accuracy. For example, if you have 100 fruitsd99 apples and 1 orangedand your model predicts that all 100 items are apples, then it is 99% accurate. But that model failed to identify the difference between apples and oranges. f1 scores help you judge whether a model is actually doing well as classifying when you have an imbalance in the categories you are trying to tag. An f1 score is an average of both precision and recall. More specifically, it is a type of average called the harmonic mean, which tends to be less than the arithmetic or geometric means. Recall answers: “Given a positive example, how likely is the classifier going to detect it?” It is the ratio of true positives to the sum of true positives and false negatives. Precision answers: “Given a positive prediction from the classifier, how likely is it to be correct?” It is the ratio of true positives to the sum of true positives and false positives. For f1 to be high, both recall and precision of the model must be high. Feed-forward network: A neural network that takes the initial input and triggers the activation of each layer of the network successively, without circulating. Feed-forward nets contrast with recurrent and recursive nets in that feed-forward nets never let the output of one node circle back to the same or previous nodes. Gated recurrent unit (GRU): A pared-down LSTM. GRUs rely on gating mechanisms to learn long-range dependencies while sidestepping the vanishing gradient problem. They include reset and update gates to decide when to update the GRUs memory at each time step. See Learning Phrase Representations using RNN EncodereDecoder for Statistical Machine Translation at https://www. aclweb.org/anthology/D14-1179. Gaussian distribution: A Gaussian, or normal, distribution is a continuous probability distribution that represents the probability that any given observation will occur on different points of a range. Visually, it resembles what’s usually called a Bell curve. Gloval Vectores (GloVe): GloVe is a generalization of Tomas Mikolov’s word2vec algorithms, a technique for creating neural word embeddings. It was first presented at NIPS by Jeffrey Pennington, Richard Socher, and Christopher Manning of Stanford’s NLP department. Deeplearning4j’s implementation of GloVe is here. See GloVe: Global Vectors for Word Representation reference at https:// nlp.stanford.edu/pubs/glove.pdf. Gradient clipping: Gradient clipping is one way to solve the problem of exploding gradients. Exploding gradients arise in deep networks when gradients associating weights and the net’s error become too large. Exploding gradients are frequently encountered in RNNs dealing with long-term dependencies. One way to clip gradients is to normalize them when the L2 norm of a parameter vector surpasses a given threshold. Gradient descent: The gradient is a derivative, which you will know from differential calculus. That is, it is the ratio of the rate of change of a neural net’s parameters and the error it produces, as it learns how to reconstruct a dataset or make guesses about labels. The process of minimizing error is called gradient descent. Descending a gradient has two aspects: choosing the direction to step in (momentum) and choosing the size of the step (learning rate). Since MLPs are, by construction,
212
Chapter 7 Synthesizing DNA-encoded data
differentiable operators, they can be trained to minimize any differentiable objective function using gradient descent. The basic idea of gradient descent is to find the derivative of the objective function with respect to each of the network weights and then adjust the weights in the direction of the negative slope. Graphical models: A directed graphical model is another name for a Bayesian net, which represents the probabilistic relationships between the variables represented by its nodes. Highway networks: An architecture introduced by Schmidhuber and colleagues to let information flow unhindered across several RNN layers on so-called “information highways.” The architecture uses gating units that learn to regulate the flow of information through the net. Highway networks with hundreds of layers can be trained directly using SGD, which means they can support very deep architectures. Hyperplane: A hyperplane in an n-dimensional Euclidean space is a flat, n 1-dimensional subset of that space that divides the space into two disconnected parts. What does that mean intuitively? First think of the real line. Now pick a point. That point divides the real line into two parts (the part above that point and the part below that point). The real line has 1 dimension, while the point has zero dimensions. So a point is a hyperplane of the real line. Now think of the two-dimensional plane. Now pick any line. That line divides the plane into two parts (“left” and “right” or maybe “above” and “below”). The plane has two dimensions, but the line has only one. So a line is a hyperplane of the second plane. Notice that if you pick a point, it does not divide the second plane into two parts. So one point is not enough. Now think of a 3-D space. Now to divide the space into two parts, you need a plane. Your plane has two dimensions, your space has three. So a plane is the hyperplane for a 3-D space. OK, now we have run out of visual examples. But suppose you have a space of n dimensions. You can write down an equation describing an n 1-dimensional object that divides the n-dimensional space into two pieces. That is a hyperplane. ImageNet Large Scale Visual Recognition Challenge (ILSVRC): The ImageNet Large Scale Visual Recognition Challenge is the formal name for ImageNet, a yearly contest held to solicit and evaluate the best techniques in image recognition. Deep convolutional architectures have driven error rates on the ImageNet competition from 30% to less than 5%, which means they now have humanlevel accuracy. International Conference for Machine Learning: ICML, or the International Conference for Machine Learning, is a well-known and well-attended machine-learning conference. International Conference on Learning Representations: ICLR, pronounced “I-clear.” An important conference. See Representation learning. Iteration: An iteration is an update of weights after analyzing a batch of input records. See Epoch for clarification. LeNet: Google’s LeNet architecture is a deep convolutional network. It won ILSVRC in 2014 and introduced techniques for paring the size of a CNN, thus increasing computational efficiency. See Going Deeper with Convolutions at https://www.cs.unc.edu/wwliu/papers/GoogLeNet.pdf. Log likelihood: Related to the statistical idea of the likelihood function. Likelihood is a function of the parameters of a statistical model. The probability of some observed outcomes given a set of parameter values is referred to as the likelihood of the set of parameter values given the observed outcomes.
Appendices
213
Long short-term memory units (LSTMs): A form of recurrent neural network invented in the 1990s by Sepp Hochreiter and Juergen Schmidhuber, and now widely used for image, sound, and time series analysis, because they help solve the vanishing gradient problem by using memory gates. Alex Graves made significant improvements to the LSTM with what is now known as the Graves LSTM, which Deeplearning4j implements here. See LSTMs with Deeplearning4j; Using RNNs with Deeplearning4j; Original Paper: Long Short-Term Memory at https://skymind.ai/wiki/lstm. Maximum likelihood estimation: “Say you have a coin and you are not sure it is “fair.” So you want to estimate the “true” probability it will come up heads. Call this probability P, and code the outcome of a coin flip as 1 if it is heads and 0 if it is tails. You flip the coin four times and get 1, 0, 0, 0 (i.e., 1 head and 3 tails). What is the likelihood that you would get these outcomes, given P? Well, the probability of heads is P, as we defined it earlier. That means the probability of tails is (1 P). So, the probability of 1 head and 3 tails is P (1 P)3 (Edit: We call this the “likelihood” of the data). If we “guess” that the coin is fair, that is saying P ¼ 0.5, so the likelihood of the data is L ¼ 0.5 (1 0.5) 3 ¼ 0.0625. What if we guess that P ¼ 0.45? Then L ¼ 0.45 (1 0.45)3 ¼ w0.075. So P ¼ 0.45 is actually a better estimate than P ¼ 0.5, because the data are “more likely” to have occurred if P ¼ 0.45 than if P ¼ 0.5. At P ¼ 0.4, the likelihood is 0.4 (1e0.4)3 ¼ 0.0864. At P ¼ 0.35, the likelihood is 0.35 (1e0.35)3 ¼ 0.096. In this case, it turns out that the value of P that maximizes the likelihood is P ¼ 0.25. So that is our maximum likelihood estimate for P. In practice, maximum likelihood is harder to estimate than this (with predictors and various assumptions about the distribution of the data and error terms), but that is the basic concept behind it. So in a sense, probability is treated as an unseen, internal property of the data. A parameter. And likelihood is a measure of how well the outcomes recorded in the data match our hypothesis about their probability; i.e., our theory about how the data are produced. The better our theory of the data’s probability, the higher the likelihood of a given set of outcomes. MNIST: MNIST is the “hello world” of deep learning datasets. Everyone uses MNIST to test their neural networks, just to see if the net actually works at all. MNIST contains 60,000 training examples and 10,000 test examples of the handwritten numerals 0e9.These images are 28 pixels, which means they require 784 nodes on the first input layer of a neural network. MNIST is available for download. Model: In neural networks, the model is the collection of weights and biases that transform input into output. A neural network is a set of algorithms that update models such that the models guess with less error as they learn. A model is a symbolic, logical, or mathematical machine whose purpose is to deduce output from input. If a model’s assumptions are correct, then one must necessarily believe its conclusions. Neural networks produced trained models that can be deployed to process, classify, cluster, and make predictions about data. Model parallelism: Another way to accelerate neural net training on very large datasets. Here, instead of sending batches of faces to separate neural networks, let us imagine a different kind of image: an enormous map of the earth. Model parallelism would divide that enormous map into regions, and it would park a separate CNN on each region, to train on only that area and no other. Then, as each enormous map was peeled off the dataset to train the neural networks, it would be broken up, and different patches of it would be sent to train on separate CNNs. No parameter averaging necessary here. Model score: As your model trains the goal of training is to improve the “score” for the output or the overall error rate. Then, we will present a graph of the score for each iteration. For text-based console output of the score as the model trains, you would use Score Iteration Listener.
214
Chapter 7 Synthesizing DNA-encoded data
Multilayer perceptron (MLP): MLPs are perhaps the oldest form of deep neural network. They consist of multiple, fully connected feed-forward layers. Nesterov’s Momentum: Momentum, also known as Nesterov’s momentum, influences the speed of learning. It causes the model to converge faster to a point of minimal error. Momentum adjusts the size of the next step, the weight update, based on the previous step’s gradient. That is, it takes the gradient’s history and multiplies it. Before each new step, a provisional gradient is calculated by taking partial derivatives from the model, and the hyperparameters are applied to it to produce a new gradient. Momentum influences the gradient your model uses for the next step. Neural machine translation: Maps one language to another using neural networks. Typically, recurrent neural networks are used to ingest a sequence from the input language and output a sequence in the target language. See Sequence to Sequence Learning with Neural Networks at https://arxiv.org/ abs/1409.3215. Noise-contrastive estimations (NCE): Offers a balance of computational and statistical efficiency. It is used to train classifiers with many classes in the output layer. It replaces the softmax probability density function, an approximation of a maximum likelihood estimate that is cheaper computationally. Nonlinear transform function: A function that maps input on a nonlinear scale such as sigmoid or tanh. By definition, a nonlinear function’s output is not directly proportional to its input. Normalization: The process of transforming the data to span a range from 0 to 1. Object-oriented programming (OOP): While deep learning and object-oriented programming do not necessarily go together, Deeplearning4j is written in JAVA following the principles of OOP. In OOP, you create so-called objects, which are generally abstract nouns representing a part in a larger symbolic machine (e.g., in Deeplearning4j, the object class Dataset Iterator traverses across datasets and feeds parts of those datasets into another process, iteratively, piece by piece). Dataset Iterator is actually the name of a class of object. In any particular object-oriented program, you would create a particular instance of that general class, calling it, say, “iter” like this: new Dataset Iterator iter. Every object is really just a data structure that combines fields containing data and methods that act on the data in those fields. The way you talk about those fields and methods is with the dot operator and parentheses () that contain parameters. For example, if you wrote iter.next(5), then you would be telling the Dataset Iterator to go across a dataset processing five instances of that data (say five images or records) at a time, where next is the method you call, and 5 is the parameter you pass into it. You can learn more about Dataset Iterator and other classes in Deeplearning4j in our Javadoc at https://www.oracle.com/technetwork/java/javase/documentation/index- 137868.html. Objective function: Also called a loss function or a cost function, an objective function defines what success looks like when an algorithm learns. It is a measure of the difference between a neural net’s guess and the ground truth; that is, the error. Measuring that error is a precondition to updating the neural net in such a way that its guesses generate less error. The error resulting from the loss function is fed into backpropagation to update the weights and biases that process input in the neural network. One-hot encoding: Used in classification and bag of words. The label for each example is all 0s, except for a 1 at the index of the actual class to which the example belongs. For BOW, the 1 represents the word encountered. Pooling: Pooling, max pooling, and average pooling are terms that refer to downsampling or subsampling within a convolutional network. Downsampling is a way of reducing the amount of data
Appendices
215
flowing through the network and therefore decreasing the computational cost of the network. Average pooling takes the average of several values. Max pooling takes the greatest of several values. Max pooling is currently the preferred type of downsampling layer in convolutional networks. Probability density: Used in unsupervised learning, with algorithms such as autoencoders, VAEs, and GANs. A probability density essentially says, “for a given variable (e.g., radius) what, at that particular value, is the likelihood of encountering an event or an object (e.g., an electron)?” So if I am at the nucleus of an atom and I move to, say, 1 Angstrom away, at 1 Angstrom, there is a certain likelihood I will spot an electron. But we like to not just ask for the probability at one point; we would sometimes like to find the probability for a range of points: What is the probability of finding an electron between the nucleus and 1 Angstrom, for example. So we add up (“integrate”) the probability from zero to 1 Angstrom. For the sake of convenience, we sometimes employ “normalization”; that is, we require that adding up all the probabilities over every possible value will give us 1.00000000 (etc.). Probability distribution: A probability distribution is a mathematical function and/or graph that tells us how likely something is to happen. So, for example, if you are rolling two dice and you want to find the likelihood of each possible number you can get, you could make a chart that looks like this. As you can see, you are most likely to get a 7, then a 6, then an 8, and so on. The numbers on the left are the percent of the time where you will get that value, and the ones on the right are a fraction (they mean the same thing, just different forms of the same number). The way that you use the distribution to find the likelihood of each outcome is this: There are 36 possible ways for the two dice to land. There are six combinations that get you 7, five that get you 6/8, four that get you 5/9, and so on. So, the likelihood of each one happening is the number of possible combinations that get you that number divided by the total number of possible combinations. For 7, it would be 6/36, or 1/6, which you will notice is the same as what we see in the graph. For 8, it is 5/36, etc. The key thing to note here is that the sum of all of the probabilities will equal 1 (or, 100%). That is really important, because it is absolutely essential that there be a result of rolling the two die every time. If all the percentages added up to 90%, what the heck is happening that last 10% of the time? So, for more complex probability distributions, the way that the distribution is generated is more involved, but the way you read it is the same. If, for example, you see a distribution that looks like this, you know that you are going to get a value of m 40% (corresponding to 0.4 on the left side) of the time whenever you do whatever the experiment or test associated with that distribution. The percentages in the shaded areas are also important. Just like earlier when I said that the sum of all the probabilities has to equal 1 or 100%, the area under the curve of a probability distribution has to equal 1, too. You do not need to know why that is (it involves calculus), but it is worth mentioning. You can see that the graph I linked is actually helpfully labeled; the reason they do that is to show you that you what percentage of the time you are going to end up somewhere in that area. So, for example, about 68% of the time, you will end up between 1s and 1s. Reconstruction entropy: After applying Gaussian noise, a kind of statistical white noise, to the data, this objective function punishes the network for any result that is not closer to the original input. That signal prompts the network to learn different features in an attempt to reconstruct the input better and minimize error.
216
Chapter 7 Synthesizing DNA-encoded data
Rectified linear units: Rectified linear units, or reLUs, are a nonlinear activation function widely applied in neural networks because they deal well with the vanishing gradient problem. They can be expressed as f(x) ¼ max(0, x), where activation is set to 0 if the output does not surpass a minimum threshold, and activation increases linearly above that threshold. Recurrent neural networks (RNNs): While a multilayer perceptron (MLP) can only map from input to output vectors, an RNN can in principle map from the entire history of previous inputs to each output. Indeed, the equivalent result to the universal approximation theory for MLPs is that an RNN with a sufficient number of hidden units can approximate any measurable sequence-to-sequence mapping to arbitrary accuracy (Hammer, 2000; https://www.springer.com/gp/book/9781852333430. “The key point is that the recurrent connections allow a ‘memory’ of previous inputs to persist in the network’s internal state, which can then be used to influence the network output. The forward pass of an RNN is the same as that of an MLP with a single hidden layer, except that activations arrive at the hidden layer from both the current external input and the hidden layer activations one step back in time.”dAlex Graves; see https://arxiv.org/abs/1308.0850. Recursive neural networks: Recursive neural networks learn data with structural hierarchies, such as text arranged grammatically, much like recurrent neural networks learn data structured by its occurrence in time. Their chief use is in natural-language processing, and they are associated with Richard Socher of Stanford’s NLP lab. Reinforcement learning: A branch of machine learning that is goal oriented; that is, reinforcement learning algorithms have as their objective to maximize a reward, often over the course of many decisions. Unlike deep neural networks, reinforcement learning is not differentiable. Representation learning: Learning the best representation of input. A vector, for example, can “represent” an image. Training a neural network will adjust the vector’s elements to represent the image better or lead to better guesses when a neural network is fed the image. The neural net might train to guess the image’s name, for instance. Deep learning means that several layers of representations are stacked atop one another, and those representations are increasingly abstract; i.e., the initial, low-level representations are granular and may represent pixels, whereas the higher representations will stand for combinations of pixels, and then combinations of combinations, and so forth. Residual networks (ResNet): Microsoft Research used deep residual networks to win ImageNet in 2015. ResNets create “shortcuts” across several layers (deep ResNets have 150 layers), allowing the net to learn so-called residual mappings. ResNets are similar to nets with highway layers, although they are data independent. Microsoft Research created ResNets by generating different deep networks automatically and relying on hyperparameter optimization. See Deep Residual Learning for Image Recognition at https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_ Image_Recognition. Restricted Boltzmann machine (RBM): Boltzmann machines that are constrained to feed input forward symmetrically, which means all the nodes of one layer must connect to all the nodes of the subsequent layer. Stacked RBMs are known as a deep belief network and are used to learn how to reconstruct data layer by layer. Introduced by Geoff Hinton, RBMs were partially responsible for the renewed interest in deep learning that began around 2006. In many labs, they have been replaced with more stable layers such as Variational Autoencoders. See A Practical Guide to Training Restricted Boltzmann Machines at https://www.cs.toronto.edu/whinton/absps/guideTR.pdf. RMSProp: An optimization algorithm like Adagrad. In contrast to Adagrad, it relies on a decay term to prevent the learning rate from decreasing too rapidly.
Appendices
217
Score: Measurement of the overall error rate of the model. The score of the model will be displayed graphically in the Web UI or it can be displayed in the console by using Score Iteration Listener. Serialization: Serialization is how you translate data structures or object state into storable formats. Deeplearning4j’s nets are serialized, which means they can operate on devices with limited memory. Skipgrams: The prerequisite to a definition of skipgrams is one of n-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An unigram represents one “item,” a bigram two, a trigram three, and so forth. Skipgrams are n-grams in which the items are not necessarily contiguous. This can be illustrated best with a few examples. Skipping is a form of noise, in the sense of noising and denoizing, which allows neural nets to better generalize their extraction of features. See how skipgrams are implemented in Word2vec. Softmax: A function used as the output layer of a neural network that classifies input. It converts vectors into class probabilities. Softmax normalizes the vector of scores by first exponentiating and then dividing by a constant. See A Scalable Hierarchical Distributed Language Model at https://www. cs.toronto.edu/wamnih/papers/hlbl_final.pdf. Stochastic gradient descent (SGD): Optimizes gradient descent and minimizes the loss function during network training. Stochastic is simply a synonym for “random.” A stochastic process is a process that involves a random variable, such as randomly initialized weights. Stochastic derives from the Greek word stochazesthai, “to guess or aim at.” Stochastic processes describe the evolution of, say, a random set of variables, and as such, they involve some indeterminacydquite the opposite of having a precisely predicted process that is deterministic and has just one outcome. The stochastic element of a learning process is a form of search. Random weights represent a hypothesis, an attempt, or a guess that one tests. The results of that search are recorded in the form of a weight adjustment, which effectively shrinks the search space as the parameters move toward a position of less error. Neural network gradients are calculated using backpropagation. SGD is usually used with minibatches, such that parameters are updated based on the average error generated by the instances of a whole batch. See SGD Updater in Deeplearning4j at https://deeplearning4j.org/docs/latest/nd4j-nnupdaters. Support vector machine (SVM): While SVMs are not neural networks, they are an important algorithm that deserves explanation: An SVM is just trying to draw a line through your training points. So, it is just like regular old linear regression except for the following three details: (1) There is an epsilon parameter that means “If the line fits a point to within epsilon, then that is good enough; stop trying to fit it and worry about fitting other points.” (2) There is a C parameter and the smaller you make it, the more you are telling it to find “nonwiggly lines.” So, if you run SVR and get some crazy wiggly output that is obviously not right, you can often make C smaller and it will stop being crazy. And finally, (3) when there are outliers (e.g., bad points that will never fit your line) in your data, they will only mess up your result a little bit. This is because SVR only gets upset about outliers in proportion to how far away they are from the line it wants to fit. This is contrasted with normal linear regression that gets upset in proportion to the square of the distance from the line. Regular linear regression worries too much about these bad points. TL;DR: SVR is trying to draw a line that gets
218
Chapter 7 Synthesizing DNA-encoded data
within epsilon of all the points. Some points are bad and cannot be made to get within epsilon, and SVR does not get too upset about them, whereas other regression methods flip out. For further information, see https://www.scribd.com/document/380170103/Deeplearning4j-Org-Glossary. Tensors: Here is an example of tensor along dimension (TAD): vanishing gradient problem. The vanishing gradient problem is a challenge to the confront backpropagation over many layers. Backpropagation establishes the relationship between a given weight and the error of a neural network. It does so through the chain rule of calculus, calculating how the change in a given weight along a gradient affects the change in error. However, in very deep neural networks, the gradient that relates the weight change to the error change can become very small. So small that updates in the net’s parameters hardly change the net’s guesses and error; so small, in fact, that it is difficult to know in which direction the weight should be adjusted to diminish error. Nonlinear activation functions, such as sigmoid and tanh, make the vanishing gradient problem particularly difficult, because the activation function tapers off at both ends. This has led to the widespread adoption of rectified linear units (reLUs) for activations in deep nets. It was in seeking to solve the vanishing gradient problem that Sepp Hochreiter and Juergen Schmidhuber invented a form of recurrent network called an LSTM in the 1990s. The inverse of the vanishing gradient problem, in which the gradient is impossibly small, is the exploding gradient problem, in which the gradient is impossibly large (i.e., changing a weight has too much impact on the error). See On the difficulty of training recurrent neural networks at https://www.bioinf.jku.at/ publications/older/2604.pdf. Transfer learning: When a system can recognize and apply knowledge and skills learned in previous domains or tasks to novel domains or tasks. That is, if a model is trained on image data to recognize one set of categories, transfer learning applies if that same model is capable, with minimal additional training, or recognizing a different set of categories. For example, trained on 1000 celebrity faces, a transfer learning model can be taught to recognize members of your family by swapping in another output layer with the nodes “mom,” “dad,” “elder brother,” or “younger sister” and training that output layer on the new classifications. Vector: Word2vec and other neural networks represent input as vectors. A vector is a data structure with at least two components, as opposed to a scalar, which has just one. For example, a vector can represent velocity, an idea that combines speed and direction: wind velocity ¼ 50 mph, 35 northeast. A scalar, on the other hand, can represent something with one value like temperature or height: 50 C, 180 cm. Therefore, we can represent two-dimensional vectors as arrows on an xey graph, with coordinates x and y each representing one of the vector’s values. Two vectors can relate to one another mathematically, and similarities between them (and therefore between anything you can vectorize, including words) can be measured with precision. As you can see, these vectors differ from one another in both their length, or magnitude, and their angle, or direction. The angle is what concerns us here. VGG: A deep convolutional architecture that won the benchmark ImageNet competition in 2014. A VGG architecture is composed of 16e19 weight layers and uses small convolutional filters. Word2vec: Tomas Mikolov’s neural networks, known as Word2vec, have become widely used because they help produce state-of-the-art word embeddings. Word2vec is a two-layer neural net that processes text. Its input is a text corpus, and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. Word2vec’s applications extend beyond parsing sentences in the wild. It can be
Appendices
219
applied just as well to genes, code, playlists, social media graphs, and other verbal or symbolic series in which patterns may be discerned. Deeplearning4j implements a distributed form of Word2vec for JAVA and Scala, which works on Spark with GPUs. Xavier initialization: Based on the work of Xavier Glorot and Yoshua Bengio in their paper “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” An explanation can be found here. Weights should be initialized in a way that promotes “learning.” The wrong weight initialization will make gradients too large or too small and make it difficult to update the weights. Small weights lead to small activations, and large weights lead to large ones. Xavier weight initialization considers the distribution of output activations with regard to input activations. Its purpose is to maintain the same distribution of activations, so they are not too small (mean 0 but with small variance) or too large (mean 0 but with large variance). DL4J’s implementation of Xavier weight initialization aligns with the Glorot Bengio paper, Nd4j.randn(order, shape). muli(FastMath.sqrt (2.0/ (fanIn + fanOut))) where fanIn(k) would be the number of units sending input to k, and fanOut(k) would be the number of units receiving output from k.
Appendix 7.H Glossary (DNA terms) Allele: Alternative form of a gene or DNA sequence. Variations in clinical traits and phenotypes are allelic if they arise from the same gene sequence or locus and nonallelic if they arise from different gene sequences of different loci. Alternative splicing: Use of different exons in formation of mRNA from initially identical transcripts. This results in the generation of related proteins from one gene, often in a tissue or developmental stageespecific manner. Analytical validity: The likelihood that a test result is correct, i.e., a specific variant said to be present is present or said to be absent is absent. Array: See Microarray. Attributable risk: The difference in incidence of disease in an exposed versus unexposed population; in genetics, the exposure can be the presence of a specific genetic variation in the genome. Bacteriophage: Viruses whose hosts are bacterial cells. Base pair (bp): Two complementary nucleotides that are paired in double-stranded DNA. Adenosine (A) pairs with thymine (T), and guanine (G) pairs with cytosine (C). A bp is also used as a physical distance of length of a sequence of nucleotides, e.g., 20 bp is a chain of DNA composed of 20 nucleotides. Call rate: The rate at which assignment of a specific nucleotide base (A, G, C, T) is made at specific positions in the genome during genotyping or sequencing. Candidate gene: A gene believed to influence expression of complex phenotypes due to known biological and/or physiological properties of its products, or to its location near a region of association or linkage. cDNA (complementary DNA): A DNA copy of the messenger RNA (mRNA) transcribed from a gene. The cDNA is made from the mRNA using the enzyme reverse transcriptase. Clinical utility: The degree to which a test result guides clinical management and improves patient outcomes. Clinical validity: The likelihood that a test result correctly predicts the presence or absence of disease or risk of disease.
220
Chapter 7 Synthesizing DNA-encoded data
Codon: Three bases in a DNA or RNA sequence that specify a single amino acid. Comparative genomic hybridization (CGH): Also called array CGH, is a technology wherein a DNA test sample is competitively hybridized with a reference sample of DNA of known sequence to a DNA microarray, used to detect copy number changes in the test sample. Complex trait: A trait that has a genetic component that does not follow strict Mendelian inheritance. It may involve the interaction of two or more genes or geneeenvironment interactions. Copy number variants: Stretches of genomic sequence of roughly 1000 base pairs (1 kb) to 3 million base pairs (3 Mb) in size that are deleted or are duplicated in varying numbers. Coverage: The number of times a portion of the genome is sequenced in a sequencing reaction. Often expressed as “depth of coverage” and numerically as 1X, 2X, 3X, etc. Cytogenomic analysis: Technologies that assess the presence of copy number variants at locations throughout the genome, one example of which is comparative genomic hybridization. Denaturing high-performance liquid chromatography (DHPLC): A high-performance liquid chromatography instrument uses temperature-dependent separation of DNA containing mismatched base pairs from PCR-amplified DNA fragments for chromatographic mutation analysis. DNA barcoding: A method that uses a short genetic marker in a DNA sequence to identify it as belonging to a particular species or group of otherwise-related sequences. Environmental gene tag: Short sequences of DNA that contain bacterial genes in whole or part that can be used to aid in identification of related genetic material. Exome: The entire portion of the genome consisting of protein-coding sequences (as opposed to introns or noncoding DNA between genes). Exon: Any segment of a gene that is represented in the mature messenger RNA (mRNA) product. Frame shift mutation: Any mutation that disrupts the normal sequence of triplets causing a new sequence to be created that codes for different amino acids. Frame shift mutations are usually caused by an insertion or deletion of DNA and typically eventually produce a premature stop codon. GC content: The percentage of nucleotides in a DNA sequence that are either guanine (G) or cytosine (C). Genetic heterogeneity: A common phenotype caused by more than 1 gene. Genetic Information Nondiscrimination Act (GINA): US federal law passed in 2008 prohibiting the use of genetic information for decisions regarding employment or health insurance. Genome: The sum total of the genetic material of a cell or an organism. Genome annotation: Attachment of biological information to DNA sequence data. Genome-wide analysis: A genetic study evaluating the potential linkage of genetic markers located throughout the genome to a specific trait. This approach has been used for Mendelian disorders as well as complex traits (genome-wide association study [GWAS]). Genomic inflation factor: A mathematical term from genetic epidemiology used to control for population stratification in GWAS. Genomic medicine: A term used to describe medical advances and approaches based on human genomic information, sometimes referred to as personalized or precision medicine. Genomics: The study of genes and their function. Genotype: The specific set of two alleles inherited at a genetic locus. Haplotype: The combination of linked marker alleles (may be polymorphisms or mutations) for a given region of DNA on a single chromosome.
Appendices
221
HapMap: The International HapMap Project developed a haplotype map of the human genome, the HapMap, which describes the common patterns of human DNA sequence variation. The HapMap is a key resource for finding genes affecting health, disease, and responses to drugs and environmental factors. The first release of the HapMap was made in 2005. Heterologous expression: A research technique that causes a protein to be produced in a cell that does not normally make (i.e., express) that protein. Heterotetrameric, homotetrameric, and heteromultimeric ion channels: Ion channels made up of different combinations of protein subunits; four different subunits (heterotetrameric), four of the same subunit (homotetrameric), and two or more different subunits (heteromultimeric). Heterozygous (heterozyosity): Having two unlike alleles at a particular locus. Homozygous (homozygosity): Having two like or identical alleles at a particular locus in a diploid genome. Human Genome Project: Collective name for several projects begun in 1986 by the US Department of Energy (DOE) to create an ordered set of DNA segments from known chromosomal locations, develop new computational methods for analyzing genetic map and DNA sequence data, and develop new techniques and instruments for detecting and analyzing DNA. The joint national effort, led by DOE and the National Institutes of Health, was known as the Human Genome Project. The first draft of the human genome DNA sequence, produced by the efforts of the Human Genome Project, was completed in 2001. The Human Genome Project officially ended in April 2003. Hybridization: The bonding of single-stranded DNA or RNA into double-stranded DNA or RNA. The ability of complementary stretches of DNA or RNA to hybridize with each other is dependent on the base pair sequence. Identity by descent (IBD): The property of two or more alleles that are identical to an ancestral allele, used in gene association studies. Imputation: A statistical method for inferring genotypes that are not directly measured. Intron: A segment of DNA that is transcribed into RNA but is ultimately removed from the transcript by splicing together the sequences on either side (exons) to produce messenger RNA (mRNA). Kilobase (kb): 1000 base pairs of DNA or RNA. Library: A complete set of clones that contains all the genetic material from an organism, tissue, or specific cell type at a specific stage of development. Linkage: Two loci (genes or other designated DNA sequence) that reside close enough to each other that recombination (crossing over) rarely occurs between them. Alleles at the two loci do not assort independently at meiosis but are likely to be inherited together. Linkage disequilibrium (LD): Refers to alleles at loci close enough together that they remain inherited together through many generations because their extreme close proximity makes recombination (crossing over) between them highly unlikely. Locus (plural: loci): The physical site on a chromosome occupied by a particular gene or other identifiable DNA sequence characteristic. Megabase: One million base pairs. Mendelian disorder (single-gene disorder): A trait or disease that follows the patterns of inheritance that suggest the trait or disease is determined by a gene at a single locus. Metagenomics: Study of a collection of genetic material (genomes) from a mixed community of organisms. Metagenomics usually refers to the study of microbial communities.
222
Chapter 7 Synthesizing DNA-encoded data
Methylation: Covalent attachment of methyl groups to DNA, usually at cytosine bases. Methylation can reduce transcription from a gene and is a mechanism in X-chromosome inactivation and imprinting. Microarray: A technology used to study many genes simultaneously, usually consisting of an ordered microscopic pattern of known nucleic acid sequences on a glass slide. In a common type of microarray, a sample of DNA or RNA is added to the slide, and sequence-dependent binding is measured using sensitive fluorescent detection methods. Minor allele: The allele of a biallelic polymorphism that is less frequent in the study population. Minor allele frequency refers to the proportion of the less common of two alleles in a population (with two alleles carried by each person at each autosomal locus) ranging from less than 1% to less than 50%. Missense mutation: A mutation that is typically the change of a single nucleotide that results in the substitution of one amino acid for another in the final gene product. Mutation: Any alteration of a gene or genetic material from its natural state. Generally, mutations refer to changes that alter the gene in a negative sense causing the protein product of the gene to have an altered function. Next generation/high-throughput sequencing: DNA sequencing technology that permits rapid sequencing of large portions of the genome; so called because it vastly increases the throughput over classic Sanger sequencing. Nonsense mutation: Any mutation that results directly in the formation of a stop codon. Nonsynonymous variant: A polymorphism that results in a change in the amino acid sequence of a protein (and therefore may affect the function of the protein). Nucleotide: The combination of a nitrogen-containing base, a 5-carbon sugar, and phosphate group forming the A, G, C, T of the sequence of DNA (DNA), for example. Oncogene: A gene, one or more forms of which is associated with cancer. Many oncogenes are involved, directly or indirectly, in controlling the rate of cell growth. Patch clamp technique: A laboratory technique in electrophysiology that allows the study of single or multiple ion channels in cells. Penetrance: The proportion of individuals of a given genotype who show any evidence of the associated phenotype. Pharmacogenetic polymorphism: Genetic variants that alter the way an individual metabolizes or responds to a specific medication. Pharmacogenomics: Study of genes related to genetic controlled variation in drug responses. Phenotype: The total observable nature of an individual, resulting from interaction of the genotype with the environment. Plasmid: Circular extrachromosomal DNA molecules in bacteria that can independently reproduce. Plasmids can be used as vectors in recombinant DNA research, and they can contain genes important to bacterial virulence such as antibiotic resistance in nature. Polymerase chain reaction (PCR): A procedure in which segments of DNA (including DNA copies of RNA) can be amplified using flanking oligonucleotides called primers and repeated cycles of replication by DNA polymerase.
Appendices
223
Polymorphism: Difference in DNA sequence among individuals that may underlie differences in health. Genetic variations occurring in more than 1% of a population would be considered useful polymorphisms for genetic linkage analysis. The vast majority of DNA polymorphisms are benign and not associated with a detectable phenotype. Population stratification: Also called population structure: A form of confounding in genetic association studies caused by genetic differences between cases and controls unrelated to disease but due to sampling them from populations of different ancestries. Proband: The affected person whose disorder, or concern about a disorder, brings a family or pedigree to be genetically evaluated. Promoter: The sequence of nucleotides located 50 to the coding sequence of a gene that determines the site for binding of RNA polymerase and the initiation of transcription. More than one promoter may be present in a gene and may give rise to different versions of the protein. Prophage: The genome of a bacteriophage when it is integrated into the host bacterial genome or a plasmid. Pyrosequencing: A method of determining the ordering of nucleotide bases in a DNA molecule by measuring the synthesis of the complementary DNA strand. Quantitative PCR: A PCR-based laboratory technique that allows the accurate measurement of the amount of specific nucleic acids (usually RNA) in a sample. Read: A discrete segment of sequence information generated by a sequencing instrument; read length refers to the number of nucleotides in the segment. Recombination: The formation of a new set of alleles on a single chromosome that is not the same as either parent owing to a crossover during meiosis. Restriction fragment-length polymorphism (RFLP): A type of polymorphism that results from variation in the DNA sequence recognized by restriction enzymes. RFLPs can be used in linkage and gene association studies of traits and diseases. Single-nucleotide polymorphism (SNP): DNA sequence variations that occur when a single nucleotide (A, T, C, or G) in the genome sequence is altered. SNPs are the most abundant variant in the human genome and are the most common source of genetic variation, with more than 10 million SNPs present in the human genome, representing a density of 1 SNP for approximately every 100 bases. Stop codon (termination codon): The DNA triplet that causes translation to end when it is found in messenger RNA (mRNA). The DNA stop codons are TAG, TAA, and TGA. Tag SNP: A readily measured SNP that is in strong linkage disequilibrium with multiple other SNPs so that it can serve as a proxy for these SNPs on large-scale genotyping platforms. Translocation: A chromosomal segment that has been broken off and reinserted in a different place in the genome. Transversion: The substitution of a purine for a pyrimidine nucleotide or vice versa (e.g., an A for a C or T) in a DNA sequence. Uniparental disomy: The inheritance of both parental copies of a chromosome from one parent and no homologous chromosome from the other parent. The resulting offspring could be affected with a recessive disease if the parent contributing both copies is a carrier. Variant of unknown significance (VUS): Genetic variant that cannot be definitively determined to be associated with a specific phenotype.
224
Chapter 7 Synthesizing DNA-encoded data
Suggested readings 3D printing https://3dprint.com/217369/new-research-dna-3d-printer/. Animated study of DNA translation http://depts.washington.edu/hhmibio/translationStudyGuide.pdf. Bio Hackers https://www.wired.com/story/malware-dna-hack/. CRISPR Immune System www.sciencedirect.com/science/article/pii/S0300908415001042. Deep Learning for Neural Network https://ac.els-cdn.com/S0893608014002135/1-s2.0-S089360801400 2135main.pdf ?_tid¼ 8eb 4de88-3337- 46d 2-a894 - 0bbe 6a7b 428d&acdnat¼1543576928 _ ef4e 49ab3ca5545679a1c9c8266e1f71. DeepLocker https://i.blackhat.com/us-18/Thu-August-9/us-18-Kirat-DeepLocker-Concealing-Targeted- Attackswith-AI-Locksmithing.pdf. DeepLocker https://securityintelligence.com/deeplocker-how-ai-can-power-a-stealthy-new-breed-of-malware/. Desktop guide http://www.ecole-adn.fr/uploads/2017/03/CRISPR_101_eBook_-REF-CS.pdf. DNA-based computer for GPS https://pubs.acs.org/doi/abs/10.1021/acs.jpcb.5b02165. 3D printing https://3dprint.com/217369/new-research-dna-3d-printer/. Factoring numbers www.purplemath.com/modules/factnumb.htm. FASTQ story https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/. Fredi and Barton https://www.amazon.in/Algorithm-Design-Pearson-New-International-ebook/dp/B00 IZ0 FXUE. Hacking DNA https://techcrunch.com/2017/08/09/malicious-code-written-into-dna-infects-the-computer- thatreads-it/. http://genetics.hpi.uni-hamburg.de/FOUNTAIN.html. https://www.explainthatstuff.com/quantum-computing.html. Ignatova, Z., Martinez-Perez, I., 2008. DNA Computing Models. Springe Publishing. Illumina FASTQ https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html. Learning https://www.edureka.co/blog/deep-learning-tutorial. Leonard Adleman, “Encyclopedia Britannica”, Retrieved from: 2015-11-24. Mail Order for CRISPR https://www.scientificamerican.com/article/mail-order-crispr-kits-allow- absolutelyanyone-to-hack-dna/. Namasudra, S., Chandra Deka, G., September 2018. Advances of DNA Computing in Cryptography. CRC Press. Predicting crime with DNN Prediction crime with DNN https://journals.plos.org/plosone/article?id¼10.1371/ journal.pone.0176244. Quantum Computers comparison https://www.nextbigfuture.com/2015/12/google-finds-dwave-quantumannealer-is.html. Quantum Annealing https://www.researchgate.net/publication/51117464_Quantum_annealing_with_ manufactured spins. The DNA writer http://montessorimuddle.org/2013/02/02/dna-writer/. Wikipedia Contributors, 2018. Bioinformatics: Sequence Alignment, Synthetic Biology, Whole Genome Sequencing, FASTQ Format, FASTA Format. Focus On Publishing. Y. Erlich, D. Zielinski https://www.researchgate.net/publication/314195927_DNA_Fountain_enables_ a_robust_ and_efficient_storage_architecture.
CHAPTER
8
Sequencing DNA-encoded data
Someday in the next thirty years, very quietly one day we will cease to be the brightest things on Earth. dJames McAlear
Today we are on the cusp of an epoch-making transition, from being passive observers of Nature to being active choreographers of Nature. This is the central message of Visions. dMichio Kaku
Computers will become so powerful and widespread that the surface of the earth becomes a “living” membrane, endowed with planetary “intelligence,” creating the fabled Magic Mirror featured often in fairy tales. . Further, we will “know the mind of God” when we finalize the equations to the theory of everything. dMichio Kaku
Systems develop goals of their own the instant they come instant they come into being. dJohn Gall
The grandiose design of our digital universe in the 21st century Connectivity is the main streamline of our civilization. The Internet is our digital universe as shown in 3D isometric view in Fig. 8.1 is expanding every second, and no one can stop this big bang. It is an incredible gravitational force that is pulling us into becoming more creative and resilient. We like the word “universe” because it brings depth and inquisitiveness. The 21st century is going to game change, and whoever controls the paradigm will be the wealthiest leader. Bill Gates controlled the software world with his Windows, and Jeff Bezos absorbed retail e-commerce with his experimentation skills. Bezos is not Einstein, but he cultivated a “predictive analysis” module in his frontal lobe, which gave him formidable “look ahead” and “what if” features. Let us go back and talk about the digital universe. Our culturedgenerally in the United Statesddictates the rule that states “nothing is going to happen, until something takes place and pushes it over.” We are talking about the explosion of data that are drowning the world. According to Computer Business Review, www.cbronline.com/big-data/85-ofdata-is-useless-to-business-and-is-creat-ing-a-33-trillion-drain-on-resources-4840818/, 85% of the data were created with no value to business. Most of these get stored permanently until they get Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00008-2 Copyright © 2020 Elsevier Inc. All rights reserved.
225
226
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.1 Soft landing or crash landing, the result will be the same. Information storage will have to move to a greener space. Something limitless, cheaper, and safe until the end of time. DNA is our storage promised land. The process flow is indicated in the numbers.
destroyed. The dark data, which have an unknown value and are redundant, obsolete, or trivial, are proving to be a huge burden to organizations that have hoarded data without processes in place to categorize and understand what they are keeping hold of. You save 100 pennies, and you go to the bank and collect one dollar. In the case of data storage, organizations are reluctant to waste time and labor to extract knowledge at the right time, or when newer data get stored, so they resort to throwing the data in the attic and leaving these to rot forever. There are unmeasurable things because first, they are dynamic, second, we do not have the tools to measure them with, and third, our measured data become obsolete the moment we stop measuring: the volume of ocean water, the size of the universe, the size of the Internet, and the size of our storage. But what drives the growth of our stored data? First, affinity for the Internet and the easiness of its access. Second, the declining price of electronic storage. Third, our migration to the online environment. Fourth, and most importantly, our lack of awareness of a storage shortage. Fifth, our migration to audio and video media.
Smart City ontology An ontology is a formal representation of a body of knowledge, within a given domain, such as crime, which is in our case Smart City. Ontologies usually consist of a set of groups of terms that are interrelated about a particular subject. Here, we are going to talk about the ontology of SC, which is
The bright sun of DNA is coming up
227
FIGURE 8.2 The monolithic representation of a “total everything-to-everything” Smart City, incorporating information and communication technologies (ICT) to enhance the quality and performance of urban services. DNA services are prevalent in storing crucial infrastructure operating information, as well as offering a medical records reference to hospitals, DNA fingerprinting, blockchain transactions data, and financial and government records.
structured like an organization chart, and each term has a specific function (attribute) and is connected (either up or down) to another term, let us say supervisor or subordinate. Fig. 8.2 is the detailed SC ontology (organization chart) with all the related functions.
The bright sun of DNA is coming up While the discovery of DNA has been a significant one in the 20th century, it will continue to revolutionize medicine, agriculture, forensics, paternity, and lately, information storage. DNA computing encompasses an evolving area of progress and continues to enable new discoveries in the domain of bioinformatics, AI (artificial intelligence)-centric cybersecurity, DNA knowledge archiving, and data storage. It is evident that one discovery will lead to another and will inspire innovation in areas that only a short while ago were considered science fiction. Dr. Michio Kaku articulated on this in his book The Physics of the Impossible: “One day, would it be possible to walk through walls? To build starships that can travel faster than the speed of light. To
228
Chapter 8 Sequencing DNA-encoded data
read other people’s minds. To become invisible. To move objects with the power of our minds. To transport our bodies instantly through outer space?” In the November 21, 2018 issue of New Science magazine, there is an impressive article titled “37 trillion Pieces of You: The Plan to Map the Entire Human Body.” It talks about a new concept that has never been discussed in the past: Everything we do, from moving to thinking, digesting food to sleeping, depends on a vast array of different cells: disc-like red blood cells, spindly nerve cells, stretched-out cells that make our muscles, the list goes on. These specialized units come together to form our tissues and organs and make us the complex organisms we are. And yet so many of them remain mysterious. Now, for the first time, we are set to make a comprehensive inventorydan aim every bit as ambitious as the Human Genome Project that decoded our DNA. The Human Cell Atlas project plans to identify and locate every type of cell we possess, and so revolutionize our understanding of the body in the same way that the first atlases transformed our view of the world. The first results are already showing how this can help us find our way around healthy bodies, not to mention finding routes to new treatments for conditions like cancer that occur when cells turn bad. While the Human Cell Atlas (HCA) project is moving with leaps and bounds, another group of bioinformaticians and biologists are putting their heads together to solve the problem of information gluttony in the world. Just an overwhelming example: “Tutu,” by Miles Davis, is one of the first songs ever encoded in DNA and resulted in 100% accuracy. If successful, DNA storage could be the answer to a uniquely 21st-century problem: information overload. Five years ago, humans had produced 4.4 zettabytes (1021) of data; that is set to explode to 160 zettabytes (each year!) by 2025. Current infrastructure can handle only a fraction of the coming data deluge, which is expected to consume all the world’s microchip-grade silicon by 2040. Most digital archivesdfrom music to satellite images to research filesdare currently saved on magnetic tape. Tape is cheap. But it takes up space. And it has to be replaced roughly every 10 years. “Today’s technology is already close to the physical limits of scaling,” says Victor Zhirnov, chief scientist of the Semiconductor Research Corporation. “DNA has an information-storage density several orders of magnitude higher than any other known storage technology.” How dense exactly? Imagine formatting every movie ever made into DNA; it would be smaller than the size of a sugar cube. And it would last for 10,000 years. Sequencing Standard Created by Illumina, Inc. We would like to start with some key terms that are used very frequently in this section: • • • • • •
Binary base call (BCL) Next-generation sequence (NGS) FastQ is a text-based sequencing data file format Base space sequencing (BSS) Sequence alignment/map (SAM) Illumina sequencing data access
How to retrieve your Illumina Solexa sequencing data The most practical way to get back the sequenced file is through file transfer protocol (FTP) programs. When the sequencing and postprocessing of your Illumina Solexa Sequencing statement of work is
How to retrieve your Illumina Solexa sequencing data
229
FIGURE 8.3 Our FTP server is located at ftp://ftp.bcgsc.ca/. We recommend the following FTP programs to access your data: For the Windows environment, there are lots of FTP clients available on Windows and WS-FTP is the best. It can be purchased from the following website: www.ipswitch.com. Software WS-FTP was acquired from TechSmith Company.
complete, we make the data available to you protected under a username and password on our FTP server. We will email you with this username and password when the data are available, and the path to location of your data files on our FTP server. The data will only remain on the server for 4 weeks, so it is important to retrieve your data from our server in a timely manner. Fig. 8.3 is a sample of the FTP software called LPswich-WS-ftp. We use it to transfer reliably data from a user laptop to central server.
Mac OS X You can use Cyberduck as a freely available FTP program. Use the Open Connection button, and then your connection screen should look like that shown in Fig. 8.4.
Linux Linux (and Mac OS X users) can use command-line FTP clients. Generally, a command-line FTP session with our servers will look like as shown in Fig. 8.5.
230
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.4 Note that on the Mac, Fugu is a popular and robust SFTP client, but this does not currently work with our server as we are not running SFTP. Software WS-FTP was acquired from TechSmith Company.
FIGURE 8.5 A screen shot from our FTP program connected to Canada’s Center to retrieve binary data.
Illumina DNA sequencing operations?
231
What is the Illumina method of DNA sequencing? Illumina sequencing has been used to sequence many genomes and has enabled the comparison of DNA sequences to improve understanding of health and disease. Illumina sequencing generates many millions of highly accurate reads making it much faster and cheaper than other available sequencing methods.
Illumina DNA sequencing operations? The first step in this sequencing technique is to break up the DNA into more manageable fragments of around 200 to 600 base pairs. Fig. 8.6 shows Illumina sequencing machines in the sequencing lab at the Sanger Institute. Short sequences of DNA called adaptors are attached to the DNA fragments. The DNA fragments attached to adaptors are then made single stranded. This is done by incubating the fragments with sodium hydroxide. Once prepared, the DNA fragments are washed across the flowcell. The complementary DNA binds to primers on the surface of the flowcell, and DNA that does not attach is washed away. The DNA attached to the flowcell is then replicated to form small clusters of DNAs with the same sequence. When sequenced, each cluster of DNA molecules will emit a signal that is strong enough to be detected by a camera. Unlabeled nucleotide bases and DNA polymerase are then added to lengthen and join the strands of DNA attached to the flowcell. This creates “bridges” of double-stranded DNA between the primers on the flowcell surface. The double-stranded DNA is then broken down into single-stranded DNA using heat, leaving several million dense clusters of identical DNA sequences. Primers and fluorescently labeled terminators (terminators are a version of nucleotide basedA, C, G, or Tdthat stop DNA synthesis) are
FIGURE 8.6 Courtesy of Illumina; https://biotech.unl.edu/technology-illumina-solexa-sequencing; image credit Genome Research Limited.
232
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.7 The central dogma of molecular biology process explains the flow of genetic information, from DNA to RNA to make a functional product, a protein.
added to the flowcell. The primer attaches to the DNA being sequenced. The DNA polymerase then binds to the primer and adds the first fluorescently labeled terminator to the new DNA strand. Once a base has been added, no more bases can be added to the strand of DNA until the terminator base is cut from the DNA. One of the most important types of information contained in DNA is the instructions for how to build various proteins, as proteins do most of the work in cells. To follow these instructions, a cell must first copy a gene into a form of RNA known as precursor messenger RNA (pre-mRNA). This process is called transcription. After being processed, the RNA (now called mRNA, or mature mRNA) is ready for translation into a protein that can carry out the instructions in the gene. Fig. 8.7 illustrates what is called the central dogma magic. Lasers are passed over the flowcell to activate the fluorescent label on the nucleotide base. This fluorescence is detected by a camera and recorded on a computer. Each of the terminator bases (A, C, G, and T) gives off a different color. The fluorescently labeled terminator group is then removed from the first base, and the next fluorescently labeled terminator base can be added alongside. And so, the process continues until millions of clusters have been sequenced. The DNA sequence is analyzed baseby-base during Illumina sequencing, making it a highly accurate method. The sequence generated can then be aligned to a reference sequence; this looks for matches or changes in the sequenced DNA.
From cell atlas project to DNA storage libraries
233
FIGURE 8.8 In the few years after the end of the Human Genome Project, the cost of genome sequencing roughly followed Moore’s law, which predicts exponential decline in the computing cost. After 2007, sequencing costs dropped precipitously. MERIT CyberSecurity Consulting DNA Engineering.
The disruptive industry of digital DNA sequencing Jeff Bezos’ meteoric success is drawn from his quote: “In today’s era of volatility, there is no other way but to re-invent. The only sustainable advantage you can have over others is agility, that’s it. Because nothing else is sustainable, everything else you create, somebody else will replicate.” Fig. 8.8 shows how the price of sequencing shot down like a missile.
From cell atlas project to DNA storage libraries Organic and digital DNA are merging. Scientists and technologists, as they have access to its engineering, have and are using DNA conventions to store books, recordings, GIFs, and planning things such as an Amazon gift card. This new ID DNA is a quantum leap into the way we can look at new challenges (and opportunities coming out of these increasingly challenging operations) and eventual cybersecurity and much bigger 360 implications. When those structure strands were sequenced, the malware launched and compromised the computer that was analyzing the sequences, allowing the team to take control of it and, of course, manipulate it. If you look on Google about directories, there are 2500 types of directories in yellow and white pages, CDs, mag tapes, websites, periodicals. They are mines of information for the users. We all use
234
Chapter 8 Sequencing DNA-encoded data
them to get the right answer at the right time. Many of the directories are cash cows, but information seekers do not mind paying exorbitant retrieval fees. Some of the directories have very sophisticated data mining engines, such as the real-estate directories. Amazon e-commerce AI engine will come back and suggest new products and prices that fit the budget. In the next few years, a new paradigm will shift cybersecurity down to the molecular level, when DNA becomes the de facto of information storage; new generation of directories will be invented and programmed in AI nanobots. And the antivirus technology (AVT) vendors will wake up to the shocking reality of DNA and start focusing on DNA binary synthesis and sequencing. DNA will get another responsibility and therefore be proclaimed as the “hard disk of human life.” An example of holistic ID would be the registering of all human identities’ DNA in distributed ledgers run by smart contracts and operating in an integrated way, with machine learning enabled by IoT devices and sensors. The challenge is, as humans, will we be able to manage all these ID data sources and will we be able to keep the necessary compliance and ethics? The new method provides an “off-the-shelf” way to perform a DNA analysis using a smartphonesized device called the MinION. “The MinION reads the DNA in real-time, and a DNA read that comes off can be interpreted right away,” Sophie Zaaijer, a researcher who led the project, told Digital Trends. For further information about Sophie Zaaijer, see https://www.eurekalert.org/pub_releases/ 2017-11/cuso- nsc112917.php.
Blockchained cannabis DNA A new type of DNA crime has emerged that will give law enforcement agencies an uphill battle and demonstrate their inability to catch up with this type of sinister technology. It is called the “blockchained cannabis use of DNA sequencing” used to create a transparent supply chain that cultivators, dispensaries, regulators, and consumers can trust. Blockchains (BCs) are global, transparent, hack-proof ledgers that are ideally suited for storing data that are valuable to many different collaborative and potentially competitive parties. As a result, BCs present immutable ledgers that anyone in the world can view and everyone can verify. It will be a blockchain-based genetic catalog of all cannabis varieties, coded in DNA sequence, which provides transparency to an opaque world of underground names and folklore medicine. This new “hybrid” technology has the unique ability in time to build the most comprehensive cannabis tracking system in the world that will bring patient safety, manufacturer transparency, and regulatory comfort. The BC system is attractive to underground lords, because they can commit their evil under a shroud of transparency without being noticed. Furthermore, BC can trace any problems during the transfer of money or information among members and will alarm them when and how it happened. BC-linked mobile phone applications could instantly verify legitimate material with QR code (bar code) links to AI tracers on Kannapedia.net. It will be a cannabis tracking system that will bring safety to malware traders, dealer transparency, and how to spoof evidence for the law. Cannabis breeders and cultivatorsdeven drug dealers and human traffickersdcan use DNA genetic sequencing to accurately fingerprint their activities and publish their genetic data to the dash BC (digital cash). They can then use the information to defend against any future surprises related to law enforcement or patent of their products. Fig. 8.9 illustrates the sequencing process.
The hidden second code in our DNA
235
FIGURE 8.9 DNA sequencing is the process of determining the exact sequence of nucleotides (base pairs, BP) within a DNA molecule. MERIT CyberSecurity Consulting DNA Engineering.
Medicinal Genomics Corporation is a unique company that creates a profitable and needed niche to help cannabis growers, producers, distributors, safety testing labs, healthcare facilities, and even individual buyers check the quality of medicinal cannabis, in other words, the whole supply chain. Medicinal Genomics utilizes a highly sophisticated next-generation sequencing laboratory, bioinformatics system, and DNA-based technologies to deliver unmatched technical solutions to decipher the genetic code and quality level of medicinal cannabis. The company Genomics sells life science tools to third-party testing laboratories established to ensure quality and safety of medicinal cannabis. This is how Medicinal Genomics leveraged DNA sequencing and BC technology to create a transparent supply chain that cultivators, dispensaries, regulators, and consumers can trust. Amazing ingenuity!
The hidden second code in our DNA The discovery of a second code hiding within DNA changed our understanding of DNA. The amazing discovery of a secret second code hiding within DNA, which instructs cells on how genes are controlled, revealed our genetic code is even more complex than previously thought. When the genetic code was deciphered, scientists assumed that it only described how proteins are made. However, the revelation made by the research team led by John Stamatoyannopoulos of the University of Washington showed that genomes use the genetic code to write two separate languages! For over 40 years we have assumed that DNA changes affecting the genetic code solely impact how proteins are made, . Now we know that this basic assumption about reading the human genome missed half of the picture. Scientists discovered that the second language instructs the cells on how genes are controlled.
For reference, see https://www.ellines.com/en/achievements/11589-the-researcher-who-discovereddouble-meaning-in-dna/.
236
Chapter 8 Sequencing DNA-encoded data
Professor Paul Davies, from the Australian Centre for Astrobiology at Macquarie University in Sydney, put forward an unusual and thought-provoking idea suggesting that there can be an alien message written in a binary code, hidden on our DNA. Instead of leaving artifacts for humans to find once they are sufficiently evolved, an advanced extraterrestrial civilization might instead incorporate information into the human genome, allowing it to be copied and maintained over immense periods of time. There can be an alien message written in a binary code, hidden on our DNA. The downside with leaving behind alien artifacts is that they will not survive for millions of years. A coded message hidden in our DNA, on the other hand, can be saved for a very long time. The coded message would only be discovered once the human race had the technology to read and understand it. Our DNA is remarkable in so many ways, and this brings us to the big question: Who is the creator of our genetic code? That is a question no one can answer. Some would say our DNA is simply a coincidence, a wonderful work of Mother Nature. People who believe in God will say our DNA offers evidence of a Cosmic Creator. Then, there are those who think an advanced extraterrestrial civilization responsible for producing our genetic code. According to Makukov and Shcherbak, humans were designed by a higher power, with a “set of arithmetic patterns and ideographic symbolic language” encoded into our DNA. They believe that 97% of noncoding sequences in human DNA. Mystery of Our Coded DNAdWho Was the “Programmer”? Does our DNA offer evidence of a cosmic creator? Maxim A. Makukov, of the Fesenkov Astrophysical Institute, and Vladimir I. Shcherbak, from the al-Farabi Kazakh National University, spent 13 years working for the Human Genome Project. They believe our species was designed by a higher genetic code from alien life forms. Whatever the truth may be, there is no doubt that our DNA is amazing, and our unique genetic code raises many questions about our existence and role in the universe.
Get on the A-train for blockchain A BC is a digital concept to store data. These data come in blocks, so imagine blocks of digital data. These blocks are chained together, and this makes the data immutable. When a block of data is chained to other blocks, its data can never be changed again. It will be publicly available to anyone who wants to see it ever again, in exactly the way it was once added to the BC. That is quite revolutionary, because it allows us to keep track records of pretty much anything we can think of (to name some, property rights, identities, money balances, medical records), without being at risk of someone tampering with those records. If I buy a house right now and add a photo of the property rights to a BC, I will always and forever be able to prove that I owned those rights at that point. Nobody can change that information if it is put on the BC. So, it is a way to save data and make it immutable.
The sunrise BC was invented by Satoshi Nakamoto in 2008 (there is still doubt about the real identity) to serve as the public transaction ledger of the cryptocurrency bitcoin. The invention of the BC for bitcoin made it the first digital currency to solve the double-spending problem without the need of a trusted authority or central server. In a 2011 article in The New Yorker, Joshua Davis claimed to have narrowed down the identity of Nakamoto to several possible individuals, including the Finnish economic sociologist Dr.
Unseen sinkholes
237
Villi Lehdonvirta and Irish student Michael Clear, then a graduate student in cryptography at Trinity CollegedDublin and now a postdoctoral student at Georgetown University. Michael Clear strongly denied he was Nakamoto, as did Lehdonvirta. For reference, see https://observer.com/2011/10/did-the-new-yorkers-joshua-davis-nail-theidentity-of-bitcoin-creator-satoshi-nakamoto/. Until now, no one knows who is the real Nakamoto. The real guy must have used this fake name to stay in hiding. In 2017, an article published by a former SpaceX intern considered the possibility of SpaceX and Tesla CEO Elon Musk being the real Satoshi Nakamoto, based on Musk’s technical expertise with financial software and history of publishing white papers. However, in a tweet on November 28, 2017, Musk denied the claim. So much for this charade. For further information, see:https://www.scmp.com/tech/leaders-founders/article/2122100/elonmusk-denies-he-mysterious-bitcoin-creator-satoshi. In 2008, Satoshi Nakamoto released the whitepaper “Bitcoin: A Peer-to-Peer Electronic Cash System,” which described the bitcoin and the BC, the technology that runs bitcoin. Nakamoto’s masterpiece became, over the past decade, one of the biggest ground-breaking technologies with potential to impact every industry from financial to manufacturing to educational institutions. By analogy, we can say that BC to bitcoin is what the Internet is to email. Cryptocurrency is the blood that runs throughout the system. Dubai prestigious magazine Esharat, in February 2018, published the article “How Blockchain is changing the future of Dubai and giving the green light to deploy Blockchain citywide.”
Unseen sinkholes There are landmines and violent convulsions in the journey toward maturity of any foundational technology. Blockchain is on the same bumpy trajectory. Look at the Internet: It started in 1962 with three terminals and today; 63% of the world population (7.7 billion as of December 2018) have access to it. Radio took 38 years to reach an audience of 50 million. Television took 13 years, and the Internet just 4 years. Malware has been the steady “invisible” shadow of the Internet, described as the deep web, which is not indexed or documented by any agency or ISP. The statistics of Internet malware are mind-boggling and steadily growing at an exponential rate. We can say that the Internet has its own dynamic entropy to fill the whole universe. Some do believe blockchain is overhypeddGartner is one of the consulting firms that understands the depth of the problemdand is overcelebrated. Our point of view on this is that this is how any technology grows toward mainstream adoption. You will hear a lot of positives and you will hear a lot of negatives. The irrational exuberance phase has had two demonstrable effects: evolution and hucksterism. Blockchain has been used successfully in cryptocurrencies, but the technology is still not “enterprise ready.” In 2017, we saw some evolution on that front as BC platforms such as Hyperledger Fabric announced new versions closer to enterprise use and Ethereum progressed toward making these solutions perform and scale to suit enterprise needs. The US Senator Thomas Carper eloquently said, “Virtual currencies, perhaps most notably Bitcoin, have captured the imagination of some, struck fear among others, and con-fused the heck out of the rest of us.” For further information, see https://www. macrobusiness.com.au/category/australian-dollar-2/page/38/.
238
Chapter 8 Sequencing DNA-encoded data
Blockchain’s competitionethereum BC, like any highly demanding object in life, stirred competition, and now we have a dozen of start-up companies that have thrown their hat in the ring, ready to market a replacement to BC. Here is what Don Tapscott, cofounder and Executive Chairman of the Blockchain Research Institute, said about Ethereum: “Ethereum blockchain has some extraordinary capabilities. One of them is that you can build smart contracts. It’s kind of what it sounds like. It’s a con-tract that self-executes, and the contract handles the enforcement, the management, performance, and payment.” Ethereum is an open software platform based on BC technology that enables developers to build and deploy decentralized applications. Like bitcoin, Ethereum is a distributed public BC network. Although there are some significant technical differences between the two, the most important distinction to note is that bitcoin and Ethereum differ substantially in purpose and capability. Bitcoin offers one particular application of BC technology, a peer-to-peer (P-2-P) electronic cash system that enables online bitcoin payments. While the bitcoin BC is used to track ownership of digital currency (bitcoins), the Ethereum BC focuses on running the programming code of any decentralized application. In the Ethereum BC, instead of mining for bitcoin, miners work to earn Ether, a type of crypto token that fuels the network. Beyond a tradeable cryptocurrency, Ether is also used by application developers to pay for transaction fees and services on the Ethereum network.
Disadvantages of blockchain Political independence: A major argument for BC is to create systems outside government control, closely related to focus on removing a “third party.” In a libertarian universe, government and government actors are always bad. Smart contracts eliminate third parties, e.g., lawyers, notaries, banks, insurance companies. The concept of “self-sovereign identity” (BC-based identity management) again seeks to remove government as prover of identity. Smart Contracts: Coined by Nick Szabo in 1996d“A smart contract is a set of promises, specified in digital form, including protocols within which the parties perform on these promises.” Extended by Ian Griggs as “Ricardian contracts”d“A digital contract that defines the terms and conditions of an interaction, between two or more peers, that is cryptographically signed and verified.” Importantly, it is both human and machine readable and digitally signed. “The ultimate test of our mission is if the legal profession can take a Ricardian contract and unambiguously decide points of dispute.”dIan Grigg Inventor of the Ricardian digital currency contract. They have been tested in court successfully cf. DigiGold v. Systemics, before the Supreme Court of Anguilla (2001). The idea included in Ethereum by Vitalik Buterin. Smart contracts in Ethereum: Smart contracts (supposedly) are pieces of code codifying agreements and trust relations deployed on a virtual machine (VM) to be automatically executed by the VM. In the BC þ smart contract world, SCs will control of high-value assets; SCs will be unchangeable, autonomous, and unstoppable; publicly visible and analyzable; run in a public, hostile environment; and written by fallible human beings. Smart contracts and the real world: “A smart contract is a piece of code which is stored on a blockchain, triggered by blockchain transactions, and which reads and writes data in that blockchain’s database.”dGideon Greenspan.
Disadvantages of blockchain
239
https://www.multichain.com/blog/2016/04/beware-impossible-smart-contract/March 16, 2018. Failures of smart contracts: There are many high-profile cases of smart contracts failing: 2016: Most famous: “The DAO”: A smart contract running a virtual company, obtained $150 million of funding in ETH, lost $60 million due to a bug in the code. 2017: Parity multisig wallet: bug resulted in loss of $30 million. 2017: parity multisig (multisignature) wallet again: $300 million frozen (and lost). largely due to coding errors. This is to be expected in a “move fast and break things” culture. Technical issue 1: A BC platform is a peer-to-peer network of nodes. The nodes collaborate to reach consensus on changes to the database. In the Ethereum (for example), state of database is the state of a “world computer” programmable via smart contracts. Assumption: A smart contract will execute as specified. Reality: Various mechanisms result in different results, e.g., reordering of transaction orders (intentionally) by miners; unexpected consequences of transactions (e.g., with the decentralized autonomous organization (DAO) allowing reinvocation). Solidity language: compilation produces unexpected resultsdit is seriously broken. https://news.ycombi-nator.com/item? id ¼ 14691212; https://goo.gl/r7qKvc. Technical issue 2: External Data. Typical examples of smart contract: An index-based agricultural insurance policy, or bank transaction. Smart contract: The smart contract has to run on every node in the BC. They all query an oracle (weather service, bank server) and expect to get same data, but there is no guarantee: Oracle may change. Oracle may be inaccessible. A smart contract responding to external data is not deterministic (cannot always give the same result). A solution here is a “trusted third party” that queries the oracle. Technical issue 3: Semantics. Smart contracts exist in multiple layersdfrom human intention through to CPU instruction. Each layer needs syntax and semanticsdsemantics to specify the meaning of concepts and the map to the real world. Real world impinges on the “blockchain world”: the meaning of concepts changes, the real world changes. Conflict of semantics leads to real-world failure: classic examples are the Mars Climate Orbiter 1998, Hospital kills patients (digitally), Michigan 2003 Nuclear attack early warning systems (repeatedly). Proper semantics means formal vocabularies (ontologies) to systematize descriptions of the world. Philosophical/political issues: Philosophically using smart contracts implies: We can describe the world, or part of the world, perfectly. This part of the world will not change. We can release the SC (Smart contract) onto a BC (Bit Coin) to run forever. There will be no mistakes in the code or the representation of the world. Logically, this denies much that we know about the world: Humans are fallible. The world changes. We usually like to have democratic/political control of processes, so we need to be able to revise a smart contract (SC), change it, and adapt it to reality and human needs. Smart contracts and the real-world March 16, 2018. For further information, see https://www.arijuels. com/press/. Technical issue 4: Toward Robustness. Strategies for more robust smart contracts: Best practicesdrisk analysis, security requirements, attack modeling, code audits. Design patterns (Gang of FourdGamma et al.)dfor SC ownership, data provider authentication, transfer of funds. Static analysis toolsdmany tools being developed especially for Ethereum’s Solidity language, but also for EVM bytecode analysis Formal verification i.e., formal proofs that code is correct. This needs to be done at the various layers mentioned. Use better, more rigorous languages for smart contracts, e.g., Tezos with the Michelson language (and many others), Use languages that do not surprise you. For further information, see https://en.org/wiki/Design_Patternswikipedia.
240
Chapter 8 Sequencing DNA-encoded data
Why Blockchain (BC) uses semantics: Most current blockchain applications have very simple semantic data models that studies the meaning of several concepts tied to the reality of Blockchain. If the meaning of the concepts changes then the real-world changes. Proper Blockchain semantics means formal vocabularies (ontologies) to systematize descriptions of the Blockchain world. Conflict of semantics leads to real world failure: Classic ex is the Mars Climate Orbiter 1998 Hospital kills patients (digitally), Michigan 2003 Nuclear attack early warning systems. Everledger launches blockchain platform to ensure transparency in diamond sourcing with standards/vocabularies/ontologies.
Malware is hovering over DNA code Biological and medical science experienced a real breakthrough when the DNA structure was decoded in 1953. DNA is a macromolecule that stores and transfers essential information about living beings, i.e., the genetic code. Francis Crick and James Watson, the scientists who described the double helix of this biopolymer, received the 1962 Nobel Prize, 9 years after the discovery. They did not know at that time that their achievement would be the object of cybertheft after half a century. Ma Bell (the phone company) monopolized the communication, and data processing ran the business authoritatively, tyrannically, and intimidatingly. The user community was hostage and had no access to hardware. You had “what you see is what you get” and had no control over software. When the Internet got out of the bottle and became the Abraham Lincoln of the community, the data processing influence transformed into a network of servers to serve the emancipated users. CIOs are caught between a rock and a hard place. We all know that CIO stands for “Career Is Over.” The wag who coined that acronym was undoubtedly referring to the burnout factor that comes with the job and the consequent short tenure of the average CIO. We should admit that this complex term “Chief— Officer,” which was created in the 1980s, became glamorous, and ended up being a job termed “Career Is Over.” A job blown out of proportion and the home stretch to expulsion flavored with insanity. Technology gives and technology takes. I think the canonical rule of technology of new systems applies: “New systems generate new problems . Complex systems exhibit unexpected behavior.” The Internet was a great invention, and 3.8 billion users are hooked on an hourly rate. The Internet was the right technology to help create start-up companies to become influential game changers. We can list a few of them, such as Google, Facebook, Amazon, Apple, Instagram, Twitter, and Ethereum. But let us not forget the dark side of technology, which gave golden opportunities to malware designers to create incredible malware software, ransomware, viruses, and how they emerged most innovative malware vectors, services, and breaching attacks against social media, banks, and healthcare institutions. In just 1 year, the sale of ransomware on the dark Web grew more than 2500%, meaning cybercrime has become a game everyone wants to play. According to Carbon Black, a leading provider of next-generation endpoint security, as mentioned in its “Ransomware Economy” 2017 report, there are some of the most stinging ransomware bear traps, such as CryptoLocker, GoldenEye, Locky and WannaCry: “It’s no secret that 2017 is shaping up to be the most notorious year on record for ransomware. Even a casual news consumer can identify several, if not all, of the menacing ransomware attacks that have cost worldwide businesses an estimated $1 billion this year.” For further information, see https://www.carbonblack.com/company/news/press-releases/ dark-web-ransomware-economy-growing-annual-rate-2500-carbon-black-research-finds/.
Case 1: blockchained malware inside DNA
241
Take, for example, Equifax. The massive cyberattack that rattled Atlanta-based consumer credit reporting agency Equifax in July affected nearly 145.5 million people globally. Since the company waited until September to notify customers of the breach, lawmakers pushed for legislative reform and several senior-level and C-suites stepped down amid suspicious share selling activity. The company said it spent $87.5 million in the third quarter on recovery efforts.
Case 1: blockchained malware inside DNA The leak of DNA or a patient’s case history is more serious than it may appear at first glance. DNA can unveil the genealogy and diseases or even predict how many years will a person live. It is the largest and most vulnerable data storage. Here is the catch: BC technology arguably may be one of the most significant inventions in computer sciences in 50 years. But now Cyber malware lords have come up with a diabolocal scheme to go molecular under the DNA umbrella and compromise DNA and digital computers. It follows the distributed denial-of-service (DDoS) architecture, but it uses BC as the rail for DNA-coded traffic. Here is briefly how it works: The attack is converted to a binary file. The file is sent to a network of zombie synthesizers that will convert the attack into DNA code. The DNAencoded payload will be sequenced by DNA sequencers and validated for 100% accuracy. Each node on the BC will decode the payload into binary before it gets fired into the target system. No antivirus software will be able to know to neutralize the massive distributed attack. DNA computers will suffer greater damage when DNA-coded payload reaches the system. Fig. 8.10 demonstrates the malware leveraged with BC. Fig. 8.11 shows the gene structure where malware professionals know how to obstruct it and to replace it with stinging payload.
FIGURE 8.10 This is the architecture of the new generation of blockchain, which will be used by the malware attackers and distributors. It starts as a conventional DDoS and terminates as a DNA sequence before injecting the payload. DDoS, distributed denial of service.
242
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.11 Gene organization. The transcriptional unit produces the RNA molecule and is defined by the transcription start site (TC) and stop site (tC). Within the transcriptional unit lies the coding sequence, from the translation start site (TL) to the stop site (tL). The upstream regulatory region may have controlling elements such as enhancers or operators in addition to the promoter (P), which is the RNA polymeraseebinding site (TC). A stop site for transcription (tC) is also required. From TC start to tC stop is sometimes called the transcriptional unit, that is, the DNA region that is copied into RNA. Within this transcriptional unit, genes have several important regions. A promoter is necessary for RNA polymerase binding, with the transcription start and stop sites defining the transcriptional unit. There may be regulatory sites for translation, namely a start site (TL) and a stop signal (tL). Other sequences involved in the control of gene expression may be present either upstream or downstream from the gene itself. MERIT CyberSecurity Consulting DNA Engineering.
Case 2: biohackingdmalware hidden in DNA On March 27, 2020, the seventh biohacking conference will take place in Los Angeles. Biohacking basically is tinkering with body chemistry by hobbyists and advance hackers who are bored with conventional DDOS, credit card fraud, Stuxnet, and all the present generation of TCP/IP-driven symmetrical cyberattacks. Fig. 8.12 gives the visual malware delivery scenario. DNA is going to be the most viable launch pad for better and discrete “short cuts” for performance enhancement in all sports events. Furthermore, biohacks promise anything from quick weight loss to enhanced brain function. Now, biohackers are looking at a bright future by stepping into the domain of DNA and altering the gene sequence to create an avant garde portfolio of invisible deadly prescriptions of genetic code loaded with malware. Biologically, a virus is a piece of RNA, which is an intermediary storage for genetic code, which temporarily duplicates a piece of the DNA (the permanent storage of genetic information in a cell). RNA then goes through some “engines,” which can duplicate it and/or convert it into proteins (genetic code is really blueprints for proteins). Proteins are the active molecules that do all the work to keep a cell “alive.” The virus is sufficiently small to enter some cells, where it hijacks the replicator engine, which makes other copies of the virus, by which the attack spreads.
Case 3: DNA malware trafficking
243
FIGURE 8.12 This is how biohackers use CRISPR-CAS9 in the sequence indicated in the figure to cut a healthy DNA strand and replace it with a contaminated DNA segment. The numbered flow indicated the steps of the operation.
The adverse effect of the virus comes from the fact that while the replicator is busy with photocopying the virus, it does not process the “normal” RNA, which comes from the cell’s own genetic code. Protein production is thus slowed down or stopped altogether, and the cell ceases to function properly (or at all). Effects on the host body depend on what kind of cells are affected, and how much it hijacks the replicators.
Case 3: DNA malware trafficking This is a modern war using DNA as a weapon to damage the life of patients, citizens, research labs, and health institutions.
DNA satellite hacking The rapidly expanding number of satellites transmitting GPS locations, cellphone signals, and other sensitive information is creating new opportunities for hackers. It is a risk exacerbated by the growing number of aging satellite systems in circulation. While it is cheaper to leave old satellites in orbit rather than pulling them from space, the outdated systems are even easier targets for hacking. Fig. 8.13 shows the six phases of the DNA malware trafficking, and it shows how smart malware by cyborg criminals.
244
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.13 In the next few years, satellitesdespecially the old onesdwill be a strategic launching pad for malware against DNA binary data sequencing. Malware vectors, in binary format, will be sent to crime-based clouds and the Dark Web for packaging, before they reach through private networks, a network of synthesizers and sequencers. Malware vectors will be hidden in DNA code and will reach their target.
DNA drone hacking The drone technology has been advancing exponentially and is becoming a real disruptive due to its futuristic advancement. Advances have been focusing on commercial drones, order deliveries, photogrammetry and remote sensing, and military as well. They are referred to as unmanned arial vehicle (UAV). The advancements introduced by UAVs are innumerable and have led the way for the full integration of UAVs, as intelligent objects, into the Internet of Things (IoT). Drones are programmed robots that can perform several things while they are flying. High-precision cameras have been used for GPS and surveying applications. The biggest advantage of AI-centric drones is that they are easily programmable (dual 1 Ghz Linux companion computer), relatively silent, and with amazing precision in reaching their targets. Ironically, they have been used in assassination of political adversaries. Demonstrates the ingenuity of the new DNA attacks using drone technology to hack DNA sequencing as shown in Fig. 8.14 supplements the attractive visual scenario of drone hacking.
Appendices
245
FIGURE 8.14 Illustration of how drone technology is being used to provide the secret supply chain of compromised DNA sequence to hospitals through Deep Web.
Appendices Appendix 8.A The cannabis dilemma Cannabis, or marijuana, is a psychoactive drug from the Cannabis plant, which can be used for medical or recreational purposes. Cannabis can be used by smoking, vaporizing, within food, or as an extract. Cannabis has mental and physical effects, such as creating a “high” or “stoned” feeling, a general change in perception, heightened mood, and an increase in appetite. The effect can last for 2e6 h. Short-term side effects may include a decrease in short-term memory, dry mouth, impaired motor skills, red eyes, and feelings of paranoia or anxiety. Long-term side effects may include addiction, decreased mental ability in those who started as teenagers, and behavioral problems in children whose mothers used cannabis during pregnancy. Cannabis is mostly used for recreation or as a medicinal drug, although it may also be used for spiritual purposes. In 2013, 128e232 million people used cannabis (2.7% of the global population between the ages of 15 and 65 years). It is the most commonly used illegaldin many states, it has become legal, such as in California, which was the first state to legalize medical marijuana back in 1996. California became even more pot-friendly in 2016 when it made it legal to use and carry up to an ounce of marijuana. The countries with the highest use among adults as of 2018 are Zambia, the United States, Canada, and Nigeria. As of 2018, 51% of people in the United States had never used cannabis. About 12% had used it in the past year, and 7.3% had used it in the past month. The United Nations’ World Drug Report stated that cannabis “was the world’s most widely produced, trafficked, and consumed drug in the world in 2010” and estimated between 128 million and 238 million users globally in 2015.
246
Chapter 8 Sequencing DNA-encoded data
Appendix 8.B Dr. Church’s Regenesis book
DNA can be used to store information at a density about a million times greater than your hard drive, as reported by science researchers today. George Church of Harvard Medical School and colleagues report that they have written an entire book in DNA, a feat that highlights the recent advances in DNA synthesis and sequencing. The team encoded a draft HTML version of a book cowritten by Church called Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves. In addition to the text, the biological bits included the information for modern formatting, images, and JavaScript to show that “DNA (like other digital media) can encode executable directives for digital machines,” they write. See www.regenesisthebook.com/. To do this, the authors converted the computational language of 0’s and 1’s into the language of DNAdthe nucleotides are typically represented by A’s, T’s G’s, and C’s. the A’s and C’s took the place of 0’s and T’s and G’s of 1’s. They then used off-the-shelf DNA synthesizers to make 54,898 pieces of DNA, each 159 nucleotides long, to encode the book, which could then be decoded with DNA sequencing. This is not the first time nonbiological information has been stored in DNA, but Church’s demonstration goes far beyond the amount of information stored in previous efforts. For example, in 2009, researchers encoded 1688 bits of text, music, and imagery in DNA, and in 2010, Craig Venter and colleagues encoded a watermarked, synthetic genome worth 7920 bits. In this study, Church and company stored 5.27 megabits of data. DNA synthesis and sequencing is still too slow and costly to be practical for most data storage, but the authors suggest DNA’s long-lived nature could make it a suitable medium for archival storage. Erik Winfree, who studied DNA-based computation at Caltech and was a 1999 TR35 winner, hopes the study will stimulate a serious discussion about what roles DNA can play in information science and technology. “The most remarkable thing about DNA is its information density, which is roughly one bit per cubic nanometer,” he writes in an email. “Technology changes things, and many old ideas for DNA information storage and information processing deserve to be revisited now e especially since DNA synthesis and sequencing technology will continue their remarkable advance.”
Appendices
247
Appendix 8.C The miracle of making protein This is a visual representation on how DNA nucleotides go through the mRNA assembly line to and end up as amino acid before it reaches its end of line as protein as shown in Fig. 8.15. Fig. 8.16 is a little challenging to remember all the Greek terms that are used to represent complex exponential numbers. Fig. 8.16 is used as a guide to the most used numbers in DNA engineering.
Appendix 8.D Unit conversion table How to convert text to binary code using JavaScript software? What you should do is convert every characterdusing the charCodeAt functiondto get the ASCII Code in decimal. Then, you can convert it to a binary value using to String (2):
MERIT CyberSecurity Consulting DNA Engineering. Here is another interactive GUI program as shown in Fig. 8.17. Second Letter Ribosome Assembly Machine Assembly instructions
12
12
13
14
13
22
First Letter
14
Third Letter
27
mRNA
DNA nucleotides Base Pair (BP)
Amino Acid Molecules Ready to Create Protein
The Tertiary Assembly of blocks in Codon One Cell
The Codon Table Original Cell Nucleotide
Messenger RNA
Move Pairs to Ribosome
Look Up Table For Amino Acid
Create The Selected Ones
FIGURE 8.15 Messenger RNA (mRNA) is a molecule in cells that carries codes (base pairs) from DNA to the ribosome assembly machine where the base pairs are converted to tertiary blocks. Ribosomes link amino acids together in the order specified by messenger RNA (mRNA) molecules. Translation creates codons accomplished by the ribosome, which go through the codon table and generate proteins. © 2014 Copyright (MERIT CyberSecurity Group); All rights are reserved.
248
Chapter 8 Sequencing DNA-encoded data
FIGURE 8.16 Mathematicians love numbers loaded with exponents, but readers prefer to have a unique symbol that corresponds to the complex number. Pick up your choice, and use your preferred figure. MERIT CyberSecurity Consulting DNA Engineering.
FIGURE 8.17 It is not easy to learn ASCII characters to binary by heart. Interestingly enough, the six letters of “Hello, this is a test” were translated into 21 binary bytes, which is easier to convert to DNA sequence. CyberSecurity Consulting DNA Engineering MERIT.
Appendices
249
How can we encode the digital data into DNA? To encode digital data into DNA or rather we can say that in A, T, G, C manner, we need to carry out DNA sequencing (to know the sequence of each bases), by different types of DNA sequences. Different types of DNA sequencers are available such as follows: • • • • •
Pacific bioscience Ion torrent Pyrosequencing Illumina sequencing Ultimately the latest method, i.e., nanopore sequencing
Sanger sequencing is one of the most accurate methods, but it is time-consuming, and all other above briefed types are costly and quickly, efficient methods. For this encoding, bioinformatics tools are also utilized. Different algorithms-based tools that are user-friendly work for these aligned, sequenced DNA or any short oligonucleotides by showing on a PC or laptop screen with available bioinformatics tools.
Appendix 8.E Glossary for blockchain (From MERIT CyberSecurity knowledge base) 16S ribosomal DNA (rDNA): The 16S rRNA is a structural component of the bacterial ribosome (part of the 30S small subunit). The 16S rDNA is the gene that encodes this RNA molecule. Owing to its essential role in protein synthesis, this gene is highly conserved across all prokaryotes. There are portions of the 16S gene that are extremely highly conserved, so that a single set of “universal” PCR primers can be used to amplify a portion of this gene from nearly all prokaryotes. The gene also contains variable regions that can be used for taxonomic identification of bacteria. Amplification and taxonomic assignment of 16S rDNA sequences is a widely used method for metagenomic analysis. Adapters: Exogenous nucleic acids that are ligated to a nucleic acid molecule to be sequenced. For example, SMRTbell adapters are hairpin loops that are ligated to both ends of the double-stranded DNA insert to produce a SMRTbell sequencing template. When adapter sequences are removed from a CCS read, the read is split into multiple subreads. Algorithm: A step-by-step method for solving a problem (a recipe). In bioinformatics, it is a set of well-defined instructions for making calculations. The algorithm can then be expressed as a set of computer instructions in any software language and implemented as a program on any computer platform. Alignment: See Sequence alignment. Alignment algorithm: See Sequence alignment. Allele: In genetics, an allele is an alternative form of a gene, such as blue versus brown eye color. However, in genome sequencing, an allele is one form of a sequence variant that occurs in any position on any chromosome, or a sequence variant on any sequence read aligned to the genomedregardless of its effect on phenotype, or even if it is in a gene. In some cases, “allele” is used interchangeably with the term “genotype.” Amplicon: An amplicon is a specific fragment or locus of DNA from a target organism (or organisms), generally 200e1000 bp in length, copied millions of times by the polymerase chain reaction (PCR). Amplicons for a single target (i.e., a reaction with a single pair of PCR primers) can be
250
Chapter 8 Sequencing DNA-encoded data
prepared from a mixed population of DNA templates such as HIV particles extracted from a patient’s blood or total bacterial DNA isolated from a medical or an environmental sample. The resulting deep sequencing provides detailed information about the variants at the target locus across the population of different DNA templates. Amplicons produced from many different PCR primers on many different DNA samples can be combined (with the aid of multiplex barcodes) into a single DNA sequencing reaction on an NGS machine. Assemble: See Sequence assembly. Assembly: See Sequence assembly. BAM file: BAM is a binary sequence file format that uses BZGF compression and indexing. BAM is the binary compressed version of the SAM (sequence alignment/map) format, which contains information about each sequence read in an NGS dataset with respect to its alignment position on a reference genome, variants in the read versus the reference genome, mapping quality, and the sequence quality string in an ASCII string that represents PHRED quality scores. BED file: BED is an extremely simple text file format that lists positions on a reference genome with respect to chromosome ID and start and stop positions. NGS reads can be represented in BED format, but only with respect to their position on the reference genome; no information about sequence variants or base quality is stored in the BED file. BLAST: The basic local alignment search tool was developed by Altschul and other bioinformaticians at the NCBI to provide an efficient method for scientists to use similarity-based searching to locate sequences in the GenBank database. BLAST uses a heuristic algorithm based on a hash table of the database to accelerate similarity searches, but it is not guaranteed to find the optimal alignment between any two sequences. BLAST is generally considered to be the most widely used bioinformatics software. Burrows Wheeler transformation (BWT): A method of indexing (and compressing) a reference genome into a graph data structure of overlapping substrings, known as a suffix tree. It requires a single computational effort to build this graph for a particular reference genome, and then it can be stored and reused when mapping multiple NGS datasets to this genome. The BWT method is particularly efficient when the data contain runs of repeated sequences, as in eukaryotic genomes, because it reduces the complexity of the genome by collapsing all copies of repeated strings. BWT works well for alignment of NGS reads to a reference genome because the sequence reads generally match perfectly or are with few mismatches to the reference. BWT methods work poorly when many mismatches and indels are present in the reads, because many alternate paths through the suffix tree must be mapped. Highly cited NGS alignment software that makes use of BWT includes BWA, Bowtie, and SOAP2. Capillary DNA sequencing: This is a method used in DNA sequencing machines manufactured by Life Technologies Applied Biosystems. The technology is a modification of Sanger sequencing that contains several innovations: the use of fluorescent labeled dye terminators (or dye primers), cycle sequencing chemistry, and electrophoresis of each sample in a single capillary tube containing a polyacrylamide gel. High voltage is applied to the capillaries causing the DNA fragments produced by the cycle sequencing reaction to move through the polymer and separate by size. Fragment sizes are determined by a fluorescent detector, and the bases that comprise the sequence of each sample are called automatically. ChIP-seq: Chromatin immunoprecipitation sequencing uses NGS to identify fragments of DNA bound by specific proteins such as transcription factors and modified histone subunits. Tissue samples or cultured cells are treated with formaldehyde, which creates covalent cross-links between DNA and
Appendices
251
associated proteins. The DNA is purified and fragmented into short segments of 200e300 bp and then immunoprecipitated with a specific antibody. The cross-links are removed, and the DNA segments are sequenced on an NGS machine (usually Illumina). The sequence reads are aligned to a reference genome, and protein-binding sites are identified as sites on the genome with clusters of aligned reads. Circular consensus (CCS) read: The consensus sequence determined using subreads taken from a single ZMW. This is not aligned against a reference sequence. In contrast to reads of insert, CCS reads require at least two full-pass subreads from the insert. Cloning: In the context of DNA sequencing, DNA cloning refers to the isolation of a single purified fragment of DNA from the genome of a target organism and the production of millions of copies of this DNA fragment. The fragment is usually inserted into a cloning vector, such as a plasmid, to form a recombinant DNA molecule, which can then be amplified in bacterial cells. Cloning requires significant time and hands-on laboratory work and creates a bottleneck for traditional Sanger sequencing projects. Consensus sequence: When two or more DNA sequences are aligned, the overlapping portions can be combined to create a single consensus sequence. In positions where all overlapping sequences have the same base (a single column of the multiple alignment), that base becomes the consensus. Various rules may be used to generate the consensus for positions where there are disagreements among overlapping sequences. A simple majority rule uses the most common letter in the column as the consensus. Any position where there is disagreement among aligned bases can be written as the letter N to designate “unknown.” There is also a set of IUPAC ambiguity codes (YRWSKMDVHB) that can be used to specify specific sets of different DNA bases that may occupy a single position in the consensus. Contig: A contiguous stretch of DNA sequence that is the result of assembly of multiple overlapping sequence reads into a single consensus sequence. A contig requires a complete tiling set of overlapping sequence reads spanning a genomic region without gaps. Coverage: The number of sequence reads in a sequencing project that align to positions that overlap a specific base on a target genome or the average number of aligned reads that overlap all positions on the target genome. de Bruijn graph: This is a graph theory method for assembling a long sequence (like a genome) from overlapping fragments (like sequence reads). The de Bruijn graph is a set of unique substrings (words) of a fixed length (a k-mer) that contain all possible words in the dataset exactly once. For genome assembly, the sequence reads are split into all possible k-mers, and overlapping k-mers are linked by edges in the graph. Reads are then mapped onto the graph of overlapping k-mers in a single pass, greatly reducing the computational complexity of genome assembly. See De novo sequencing. De novo sequencing: The sequencing of the genome of a new, previously unsequenced organism or DNA segment. This term is also used whenever a genome (or sequence dataset) is assembled by methods of sequence overlap without the use of a known reference sequence. De novo sequencing might be used for a region of a known genome that has significant mutations and/or structural variation from the reference. Diploid: A cell or organism that contains two copies of every chromosome, one inherited from each parent. DNA fragment: A small piece of DNA, often produced by a physical or chemical shearing of larger DNA molecules. NGS machines determine the sequence of many DNA fragments simultaneously.
252
Chapter 8 Sequencing DNA-encoded data
Exon: A portion of a gene that is transcribed and spliced to form the final messenger RNA (mRNA). Exons contain protein-coding sequence and untranslated upstream and downstream regions (30 UTR and 50 UTR). Exons are separated by introns, which are sequences that are transcribed by RNA polymerase, but spliced out after transcription and not included in the mature mRNA. FASTA format: This is a simple text format for DNA and protein sequence files developed by William Pearson in conjunction with his FASTA alignment software. The file has a single header line that begins with a “>” symbol followed by a sequence identifier. Any other text on the first line is also considered the header, and any text following the first carriage return/line feed is considered part of the sequence. Multiple sequences can be stored in the same text file by adding additional header lines and sequences after the end of the first sequence. FASTQ file: A text file format for NGS reads that contains both the DNA sequence and quality information about each base. Each sequence read is represented as a header line with a unique identifier for each sequence read and a line of DNA bases represented as text (GATC), which is very similar to the FASTA format. A second pair of lines is also present for each read, another header line and then a line with a string of ASCII symbols, equal in length to the number of bases in the read, which encode the PHRED quality score for each base. Fragment assembly: To determine the complete sequence of a genome or large DNA fragment, short sequence reads must be merged. In Sanger sequencing projects, overlaps between sequence reads are found and aligned by similarity methods, and then consensus sequences are generated and used to create contigs. Eventually a complete tiling of contigs is assembled across the target DNA. In NGS, there are too many sequence reads to search for overlaps among them all (a problem with exponential complexity). Alternate algorithms have been developed for de novo assembly of NGS reads, such as de Bruijn digraphs, which map all reads to a common matrix of short k-mer sequences (a problem with linear complexity). GenBank: The international archive of DNA and protein sequence data maintained by the National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine. GenBank is part of a larger set of online scientific databases maintained by the NCBI, which includes the PubMed online database of published scientific literature, gene expression, sequence variants, taxonomy, chemicals, human genetics, and many software tools to work with these data. Heterozygote: Humans and most other eukaryotes are diploid, meaning that they carry two copies of each chromosome in every somatic cell. Therefore, each individual carries two copies of each gene, one inherited from each parent. If the two copies of the gene are different (i.e., different alleles of that gene), then the person is said to be a heterozygote for that gene. A homozygote has two identical copies of that gene. In genome sequencing, every base of every chromosome can be considered as a separate data point; thus, any single base can be genotyped as heterozygous or homozygous in that individual. High-performance computing (HPC): HPC provides computational resources to enable work on challenging problems that are beyond the capacity and capability of desktop computing resources. Such large resources include powerful supercomputers with massive numbers of processing cores that can be used to run high-end parallel applications. HPC designs are heterogeneous but generally include multicore processors, multiple CPUs within a single computing device or node, graphics processing units (GPUs), and multiple nodes grouped in a cluster interconnected by high-speed networking systems. The most powerful current supercomputers can perform several quadrillion (1015) operations per second (petaflops). Trends for supercomputing architecture are for greater miniaturization of
Appendices
253
parallel processing units, which saves energy (and reduces heat), speeds message passing, and allows for access to data in shared memory caches. Histone: In eukaryotic cells, the DNA in chromosomes is organized and protected by wrapping around a set of scaffold proteins called histones. Histones are composed of six different proteins (H1, H2A, H2B, H3, H4, and H5). Two copies of each histone bind together to form a spool structure. DNA winds around the histone core about 1.65 times, using a length of 147 bp to form a unit known as the nucleosome. Methylation and other modifications of the histone proteins affect the structure and function of DNA (epigenetics). Human Genome Project (HGP): An international effort including 20 sequencing centers in China, France, Germany, Great Britain, Japan, and the United States, coordinated by the US Department of Energy and the National Institutes of Health, to sequence the entire human genome. The effort formally began in 1990 with the allocation of funds by Congress and the development of high-resolution genetic maps of all human chromosomes. The project was formally completed in two stages, the “working draft” genome in 2000 and the “finished” genome in 2003. The 2003 version of the genome was declared to have fewer than one error per 10,000 bases (99.99% accuracy), an average contig size of >27 million bases, and to cover 99% of the gene-containing regions of all chromosomes. In addition, the HGP was responsible for large improvements in DNA sequencing technology, mapping more than 3 million human SNPs, and genome sequences for Escherichia coli, the fruit fly, and other model organisms. Human Microbiome Project: An effort coordinated by the US National Institutes of Health to profile microbes (bacteria and viruses) associated with the human bodydfirst to inventory the microbes present at various locations inside and outside the body and the normal range of variation in healthy people, and then to investigate changes in these microbial populations associated with disease. Illumina sequencing: The NGS sequencing method developed by the Solexa company and then acquired by Illumina Inc. This method uses “sequencing by synthesis” chemistry to simultaneously sequence millions of w300-bp-long DNA template molecules. Many sample preparation protocols are supported by Illumina including whole-genome sequencing (by random shearing of genomic DNA), RNA sequencing, and sequencing of fragments captured by hybridization to specific oligonucleotide baits. Illumina has aggressively improved its system through many updates, at each stage generally providing the highest total yield and greatest yield of sequence per dollar of commercially available DNA sequencers each year, leading to a dominant share of the NGS market. Machines sold by Illumina include the Genome Analyzer (GA, GAII, GAIIx), HiSeq, and MiSeq. At various times, with various protocols, Illumina machines have produced NGS reads of 25, 36, 50, 75, 100, and 150 bp as well as paired end reads. Indels: Insertions or deletions in one DNA sequence with respect to another. Indels may be a product of errors in DNA sequencing, the result of alignment errors, or true mutations in one sequence with respect to anotherdsuch as mutations in the DNA of one patient with respect to the reference genome. In the context of NGS, indels are detected in sequence reads after alignment to a reference genome. Indels are called in a sample (i.e., a patient’s genome) after variant detection has established a high probability that the indel is present in multiple reads with adequate coverage and quality, and not the result of errors in sequencing or alignment. Intron: A portion of a gene that is spliced out of the primary transcript of a gene and not included in the final messenger RNA (mRNA). Introns separate exons, which contain the protein-coding portions of a gene.
254
Chapter 8 Sequencing DNA-encoded data
ktup, k-tuple, or k-mer: A short word composed of DNA symbols (GATC) that is used as an element of an algorithm. A sequence read can be broken down into shorter segments of text (either overlapping or nonoverlapping words). The length of the word is called the ktup size. Very fast exact matching methods can be used to find words that are shared by multiple sequence reads or between sequence reads and a reference genome. Word matching methods can use hash tables and other data structures that can be manipulated much more efficiently by computer software than sequence reads represented by long text strings. Mate-pair sequencing: Mate-pair sequencing is similar to paired-end sequencing; however, the size of the DNA fragments used as sequencing templates is much longer (1000e10,000 bp). To accommodate these long template fragments on NGS platforms such as Illumina, additional sample preparation steps are required. Linkers are added to the ends of the long fragments, and then the fragments are circularized. The circular molecules are then sheared to generate new DNA fragments at an appropriate size for construction of sequencing libraries (200e300 bp). From this set of sheared fragments, only those fragments containing the added linkers are selected. These selected fragments contain both ends of the original long fragment. New primers are added to both ends, and standard paired-end sequencing is performed. The orientation of the paired sequence reads after mapping to the genome is opposite from a standard paired-end method (outward facing rather than inward facing). Mate-pair methods are particularly valuable for joining contigs in de novo sequencing and for detecting translocations and large deletions. Metagenomics: The study of complete microbial populations in environmental and medical samples. Often conducted as a taxonomic survey using direct PCR (with universal 16S primers) of DNA extracted from environmental samples. Shotgun metagenomics sequences all DNA in these samples and then attempts both taxonomic and functional identification of genes encoded by microbial DNA. Microarray: A collection of specific oligonucleotide probes organized in a grid pattern of microscopic spots attached to a solid surface, such as a glass slide. The probes contain sequences from known genes. Microarrays are generally used to study gene expression by hybridizing labeled RNA extracted from an experimental sample to the array and then measuring the intensity of signal in each spot. Microarrays can also be used for genotyping by creating an array of probes that match alternate alleles of specific sequence variants. Movie: Real-time observation of an SMRT cell. Multiple alignment: A computational method that lines up, as a set of rows of text, three, or more sequences (of DNA, RNA, or proteins) to maximize the identity of overlapping positions while minimizing mismatches and gaps. The resulting set of aligned sequences is also known as a multiple alignment. Multiple alignments may be used to study evolutionary information about the conservation of bases at specific positions in the same gene across different organisms or about the conservation of regulatory motifs across a set of genes. In NGS, multiple alignment methods are used to reduce a set of overlapping reads that have been mapped to a region of a reference genome by pairwise alignment, to a single consensus sequence and also to aid in the de novo assembly of novel genomes from sets of overlapping reads created by fragment assembly methods. Next-generation (DNA) sequencing (NGS): DNA sequencing technologies that simultaneously determine the sequence of DNA bases from many thousands (or millions) of DNA templates in a single biochemical reaction volume. Each template molecule is affixed to a solid surface in a spatially separate location and then amplified to increase signal strength. The sequences of all templates are
Appendices
255
determined in parallel by the addition of complementary nucleotide bases to a sequencing primer coupled with signal detection from this event. Paired-end read: See Paired-end sequencing. Paired-end sequencing: A technology that obtains sequence reads from both ends of a DNA fragment template. The use of paired-end sequencing can greatly improve de novo sequencing applications by allowing contigs to be joined when they contain read pairs from a single template fragment, even if no reads overlap. Paired-end sequencing can also improve the mapping of reads to a reference genome in regions of repetitive DNA (and detection of sequence variants in those locations). If one read contains repetitive sequence, but the other maps to a unique genome position, then both reads can be mapped. Phred score: The Phred software was developed by Phil Green and coworkers working on the Human Genome Project to improve the accuracy of base calling on ABI sequencers (using fluorescent Sanger chemistry). Phred assigns a quality score to each base, which is equivalent to the probability of error for that base. The Phred score is the negative log (base 10) of the error probability; thus, a base with an accuracy of 99% receives a Phred score of 20. Phred scores have been adopted as the measure of sequence quality by all NGS manufacturers, although the estimation of error probability is done in many different ways (in some cases with questionable validity). Poisson distribution: A random probability distribution in which the mean is equal to the variance. This distribution describes rare events that occur with equal probability across an interval of time or space. In NGS, sequence reads obtained from sheared genomic DNA are often assumed to be Poissondistributed across the genome. Polymerase read (formerly called “read”): A sequence of nucleotides incorporated by the DNA polymerase while reading a template, such as a circular SMRTbell template. Polymerase reads are most useful for quality control of the instrument run. Polymerase read metrics primarily reflect movie length and other run parameters rather than insert size distribution. Polymerase reads are trimmed to include only the high-quality region; they include sequences from adapters and can further include sequence from multiple passes around a circular template. Pyrosequencing: A method of DNA sequencing developed in 1996 by Nyre´n and colleagues that directly detects the addition of each nucleotide base as a template is copied. The method detects light emitted by a chemiluminescent reaction driven by the pyrophosphate that is released as the nucleotide triphosphate is covalently linked to the growing copy strand. Each type of base is added in a separate reaction mix, but terminators are not used; thus a series of identical bases (a homopolymer) creates multiple covalent linkages and a brighter light emission. This chemistry is used in the Roche 454 sequencing machines. For further information on Nyre´n and colleagues, see https://www.ncbi.nlm.nih. gov/pmc/articles/PMC4727787/. Read of insert: Represents the highest-quality single sequence for an insert, regardless of the number of passes. For example, if your template received one-and-a-half subreads, that information will be combined into a read of insert. CCS is an example of a special case where at least two full subreads are collected for an insert. Reads of insert give the most accurate estimate of the length of the insert sequence loaded onto an SMRT cell. For long templates, reads of insert may be the same as polymerase reads. Reference genome: A curated consensus sequence for all of the DNA in the genome (all of the chromosomes) of a species of organism. Because the reference genome is created as the synthesis of a
256
Chapter 8 Sequencing DNA-encoded data
variety of different data sources, it may occasionally be updated; thus, a particular instance of that reference is referred to by a version number. Reference sequence: The formally recognized, official sequence of a known genome, gene, or artificial DNA construct. A reference sequence is usually stored in a public database and may be referred to by an accession number or other shortcut designation, such as human genome hg19. An experimentally determined sequence produced by an NGS machine may be aligned and compared with a reference sequence (if one exists) to assess accuracy and to find mutations. Repetitive DNA: DNA sequences that are found in identical duplicates many times in the genome of an organism. Some repetitive DNA elements are found in genomic features such as centromeres and telomeres with important biological properties. Other repetitive elements such as transposons are similar to viruses that copy themselves into many locations on the genome. Simple sequence repeats are another type of repetitive element comprised of linear repeats of 1-, 2-, or 3-base patterns such as CAGcagCAGcag.. A short sequence read that contains only repetitive sequence may align to many different genomic locations, which creates problems with de novo assembly, mapping of sequence fragments to a reference genome, and many related applications. Ribosomal DNA (rDNA): Genes that code for ribosomal RNA (rRNA) are present in multiple copies in the genomes of all eukaryotes. In most eukaryotes, the rDNA genes are present in identical tandem repeats that contain the coding sequences for the 18S, 5.8S, and 28S rRNA genes. In humans, a total of 300e400 rDNA repeats are located in regions on chromosomes 13, 14, 15, 21, and 22. These regions form the nucleolus. Additional tandem repeats of the coding sequence for the 5S rRNA are located separately. rRNA is a structural component of ribosomes and is not translated into protein. The rRNA genes are highly transcribed, contributing >80% of the total RNA found in cells. RNA sequencing methods generally include purification steps to remove rRNA or to enrich mRNA from protein-coding genes. Ribosomal RNA (rRNA): See Ribosomal DNA. RNA-seq: The sequencing of cellular RNA, used usually as a method to measure gene expression, but also used to detect sequence variants in transcribed genes, alternative splicing, gene fusions, and allele-specific expression. For novel genomes, RNA-seq can be used as experimental evidence to identify expressed regions (coding sequences) and map exons onto contigs and scaffolds. Roche 454 genome sequencer: DNA sequencers developed in 2004 by 454 Life Sciences (subsequently purchased by Roche) were the first commercially available machines that used massively parallel sequencing of many templates at once. These “next-generation sequencing” (NGS) machines increased the output (and reduced the cost) of DNA sequencing by at least three orders of magnitude over sequencing methods that used Sanger chemistry but produced shorter sequence reads. 454 machines use beads to isolate individual template molecules and an emulsion PCR system to amplify these templates in situ and then perform the sequencing reactions in a flowcell that contains millions of tiny wells that each fits exactly one bead. 454 uses pyrosequencing chemistry, which has very few base-substitution errors, but a tendency to produce insertion/deletion errors in stretches of homopolymer DNA. SAM/BAM: See BAM file. Sanger sequencing method: The method developed by Frederick Sanger in 1975 to determine the nucleotide sequence of cloned, purified DNA fragments. The method requires that DNA be denatured into single strands, and then a short oligonucleotide sequencing primer is annealed to one strand, and DNA polymerase enzyme extends the primer, adding new complementary deoxynucleotides one at a
Appendices
257
time, creating a copy of the strand. A small amount of a dideoxy nucleotide is included in the reaction, which causes the polymerase to terminate, creating truncated copies. In a reaction with a single type of dideoxynucleotide, all fragments of a specific size will end with the same base. Four separate reactions containing a single dideoxy nucleotide (ddG, ddA, ddT, and ddC) must be conducted, and then all four reactions are run on four adjacent lanes of a polyacrylamide gel. The actual sequence is determined from the length of the fragments, which correspond to the position where a dideoxy nucleotide was incorporated. Sequence alignment: An algorithmic approach to find the best matching of consecutive letters in one sequence (text symbols that represent the polymer subunits of DNA or protein sequences) with another. Generally, sequence alignment methods balance gaps with mismatches, and the relative scoring of these two features can be adjusted by the user. Sequence assembly: A computational process of finding overlaps of identical (or nearly identical) strings of letters among a set of sequence fragments and iteratively joining them together to form longer sequences. Sequence fragment: A short string of text that represents a portion of a DNA (or RNA) sequence. NGS machines produce short reads that are sequence fragments that are read from DNA fragments. Sequence read, short read: When DNA sequence is obtained by any experimental method, including both Sanger and next-generation methods, the data are obtained from individual template molecules as a string of nucleotide bases (represented by the letter symbols G, A, T, C). This string of letters is called a sequence read. The length of a sequence read is determined by the technology. Sanger reads are typically 500e800 bases long, Roche 454 reads 200e400 bases, and Illumina reads may be 25e200 bases (depending on the model of machine, reagent kit, and other variables). Sequence reads produced by NGS machines are often called short reads. Sequence variants: Differences at specific positions between two aligned sequences. Variants include single-nucleotide polymorphisms (SNPs), insertions and deletions, copy number variants, and structural rearrangements. In NGS, variants are found after alignment of sequence reads to a reference genome. A variant may be observed as a single mismatched base in a single sequence read, or it may be confirmed by variant detection software from multiple sources of data. Sequencing by synthesis: This is the term used by Illumina to describe the chemistry used in its NGS machines (Illumina Genome Analyzer, HiSeq, MiSeq). The biochemistry involves a singlestranded template molecule, a sequencing primer, and DNA polymerase, which adds nucleotides one by one to a DNA strand complementary to the template. Nucleotides are added to the templates in separate reaction mixes for each type of base (GATC), and each synthesis reaction is accompanied by the emission of light, which is detected by a camera. Each nucleotide is modified with a reversible terminator, so that only one nucleotide can be added to each template. After a cycle of four reactions adding just 1 G, A, T, or C base to each template, the terminators are removed so that another base can be added to all templates. This cycle of synthesis with each of the four bases and removal of terminators is repeated to achieve the desired read length. Sequencing primer: A short single-stranded oligonucleotide that is complementary to the beginning of a fragment of DNA that will be sequenced (the template). During sequencing, the primer anneals to the template DNA, and then DNA polymerase enzyme adds additional nucleotides that extend the primer, forming a new strand of DNA complementary to the template molecule. DNA polymerase cannot synthesize new DNA without a primer. In traditional Sanger sequencing, the
258
Chapter 8 Sequencing DNA-encoded data
sequencing primer is complementary to the plasmid vector used for cloning; in NGS, the primer is complementary to a linker that is ligated to the ends of template DNA fragments. Sequencing ZMW: A ZMW (zero-mode waveguide) that is expected to be able to produce a sequence if it is populated with a polymerase. ZMWs used for automated SMRT cell alignment are not considered sequencing ZMWs. SFF file: Standard flowgram format is a file type developed by Roche 454 for the sequencing data produced by their NGS machine. The SFF file contains both sequence and quality information about each base. The format was initially proprietary but has been standardized and made public in collaboration with the international sequence databases. SFF is a binary format and requires custom software to read it or convert it to human-readable text formats. Shotgun sequencing: A strategy for sequencing novel or unknown DNA. Many copies of the target DNA are sheared into random fragments, and then primers are added to the ends of these fragments to create a sequencing library. The library is sequenced by high-throughput methods to generate a large number of DNA sequence reads that are randomly sampled from the original target. The target DNA is reconstructed using a sequence assembly algorithm that finds overlaps between the sequence reads. This method may be applied to small sequences such as Cosmid and BAC clones or to entire genomes. SmitheWaterman alignment: A rigorous optimal alignment method for two sequences based on dynamic programming. This method always finds the optimal alignment between two sequences, but it is slow and very computationally demanding because it computes a matrix of all possible alignments with all possible gaps and mismatches. The size of this matrix increases with the square of the lengths of the sequences to be aligned, and it requires huge amounts of memory and CPU time to work with genome-sized sequences. SMRT cell: Consumable substrates comprising arrays of zero-mode waveguide nanostructures. SOLiD sequencing: The Applied Biosystems division of Life Technologies Inc. purchased the SOLiD (Supported Oligo Ligation Detection) technology from the biotech company Agencourt Personal Genomics and released the first commercial version of this NGS machine in 2007. The technology is fundamentally different from any other Sanger or NGS method in that it uses ligation of short fluorescently labeled oligonucleotides to a sequencing primer rather than DNA polymerase to copy a DNA template. Sequences are detected two bases at a time, and then base calls are made based on two overlapping oligos. Raw data files use a “color space” system that is different from the base calls produced by all other sequencing systems and requires different informatics software. This system has some interesting built-in error correction algorithms but has failed to show superior overall accuracy in the hands of customers. The yield of the system is similar to that of Illumina NGS machines. Subread: Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications like de novo assembly, resequencing, base modification analysis, and so on. Variant detection: NGS is frequently used to identify mutations in DNA samples from individual patients or experimental organisms. Sequencing can be done at the whole-genome scale; RNA-seq, which targets expressed genes; exome capture, which targets specific exon regions captured by hybridization to probes of known sequence; or amplicons for genes or regions of interest. In all cases, sequence variants are detected by alignment of NGS reads to a reference sequence and then identification of differences between the reads and the reference. Variant detection algorithms must
Suggested readings
259
distinguish between random sequencing errors, differences caused by incorrect alignment, and true variants in the genome of the target organism. Various combinations of base quality scores, alignment quality scores, depth of coverage, variant allele frequency, and the presence of nearby sequence variants and indels are all used to differentiate true variants from false positives. Recent algorithms have also made use of machine learning methods based on training sets of genotype data or large sets of samples from different patients/organisms that are sequenced in parallel with the same sample preparation methods on the same NGS machines. Zero-mode waveguide (ZMW): A nanophotonic device for confining light to a small observation volume. This can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. Physically part of an SMRT cell.
Suggested readings https://www.wired.com/2001/07/nothing-that-glitters-is-digigold/. BC Cancer Agency, n.d. (how to access data). http://www.bcgsc.ca/services/solseq/data_access. Bitcoin Glossary Merit CyberSecurity Group, n.d. www.Termanini.com. Bitcoin vulnerability and exposure. Blockchain hack. https://medium.com/coinmonks/what-is-a-51-attack-or-double-spend-attack-aa108db63474. Blockchain Threat report. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-blockchain-security-risks. pdf. Cannabis Genomic. https://www.medicinalgenomics.com/blockchained-cannabis-dna/. https://catalogdna.com/press/. Ransom, C. https://www.itsecurityguru.org/2016/11/02/the-cerber-ransomware-family-has-evolved-and-mutated/. https://medium.com/coinmonks/what-is-a-51-attack-or-double-spend-attack-aa108db63474. Coordination using Blockchain. https://www.cyber-threat-intelligence.com/research/cerber-cnc/ Malware coordination using malware https://www.cyber-threat-intelligence.com/publications/CNS2018-Cerber. pdf. DAO, D. Autonomous Organization. DNA-based data storage. http://shannon.engr.tamu.edu/wp-content/uploads/sites/138/2018/05/Texas-Part2.pdf DNA based archival System https://homes.cs.washington.edu/wbornholt/dnastorage-asplos16/. DNA is a mystery code. http://www.messagetoeagle.com/mystery-of-our-coded-dna-who-was-the-programmer/ DNA is not a computer https://www.skepticink.com/smilodonsretreat/2014/09/10/dna-is-not-like-a-computer/ DNA Sequencing https://biologydictionary.net/dna-sequencing/. https://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html? pagewanted¼all&_r¼0. DNA writer. https://earthsciweb.org/js/bio/dna-writer/. Alcamo, Edward (Ed.), 2000. DNA Technology, second ed. Academic Press. Ethereum and DAO, https://blockgeeks.com/guides/ethereum/#The_aftermath_Ethereum_splits. Foundations of blockchain and DNA. https://www.intelligenthq.com/innovation-management/8-foundationsblockchain-dna-architecture-part-1/. Gartner 5-year scenario. https://www.storagereview.com/the_fiveyear_storage_scenario_preview_of_gartner_ s_ latest_storage_technology_projections_2014. Gene Editing Video. http://www.wired.com/story/the-rise-of-dna-data-storage/ Illumina Data Access http://www.bcgsc.ca/services/solseq/data_access.
260
Chapter 8 Sequencing DNA-encoded data
Illumina Inc and sequencing. https://www.illumina.com/informatics/sequencing-data-analysis/sequence-file-formats.html. https://www.intelligenthq.com/innovation-management/. Is God a mathematician. http://www.messagetoeagle.com/divine-knowledge-is-god-a-mathematician/Kurzweil DNA http://www.kurzweilai.net/how-to-store-the-worlds-data-on-dna. Jensen, C.J., 2013. Intelligence Studies. CRC Press (Catalog DNA company). http://www.ijetjournal.org/volume4/issue5/IJET-V4I5P19.pdf. Kaku, M., 2008. Physics of the Impossible. Anchor Books. Kirby, Lorn T., 1992. DNA Fingerprinting. Oxford University Press. Malware can be encoded. http://www.technologyreview.com/s/608596/scientists-hack-a-computer-using-dna/ Malware. Mollecula Maxima. https://synbiobeta.com/a-high-level-bio-programming-language-for-designing-dna-sequences/. Namasudra, S., Chandra Deka, G., 2018. Advances of DNA Computing in Cryptography. CRC Press. New Scientist. www.newscientist.com. The New Science Magazine by Jeremy Webb “37 Trillion Pieces of You: The Plan to Map the Entire Human Body”, November 21, 2018. Smart Contracts. https://googl/g738LG. Security and privacy in DNA. http://dnasec.cs.washington.edu/dnasec.pdf. https://www.slideshare.net/christopherbrewster/smart-contacts-and-the-real-world?from_action¼save. Sikander, M., Khiyal, H., 2017. Study of Processing Storage & Cost of DN Computing with Silicon Computation. Lambert Academic Publishing. The Digital Universe (expanding). https://www.tobb.org.tr/BilgiHizmetleri/Documents/Raporlar/Expanding_ Digital_Universe_IDC_WhitePaper_022507.pdf. Termanini, D.R., 2018. The Nano Age of Digital Immunity Infrastructure. CRC Press. Tomasz, G., All, 2008. Computational Intelligence in Biomedicine and Bioinformatics. Springer. https://en.bitcoin.it/wiki/Common_Vulnerabilities_and_Exposures Blockchain 51%. Smart Contracts. http://www.webfunds.org/guide/ricardian.html. https://en.wikipedia.org/wiki/The_DAO_(organization) Deluge in DNA.
CHAPTER
9
Decoding back to binary
If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology. dBruce Schneier
We’re addressing biotechnology here because that is the immediate threshold and challenge that we now face. As the threshold for self-organizing nanotechnology approaches, we will then need to invest specifically in the development of defensive technologies in that area, including the creation of a technological immune system. Consider how our biological immune system works. When the body detects a pathogen the T-cells and other immune- system cells self-replicate rapidly to combat the invader. A nanotechnology immune system would work similarly both in the human body and in the environment and would include nanobot sentinels that could detect rogue self-replicating nanobots. When a threat was detected, defensive nanobots capable of destroying the intruders would rapidly be created (eventually with self-replication) to provide an effective defensive force. dRay Kurzweil, The Singularity Is Near
Introduction This is the third part of the subject of DNA coding and decoding. Each section digs into more detail with discussion and illustrations. We start with the phenomenon of nature’s dynamic equilibrium.
Dynamic equilibrium Originally answered: Does the total amount of water on the earth always remain the same? Michael McClennan, Research Informaticist, Department of Geoscience, UW-Madison, explains: The answer is that roughly, yes, there is about the same amount of water on the earth now as there was in the Mesozoic period. All water that is breathed, drunk, and urinated by living things remains as part of the planet’s total water content. The total amount is not exactly constant, as there are two fluxes of water between the earth and the rest of the solar system. There is a steady rain of water-bearing meteoroids hitting the planet, https:// www.usgs.gov/special-topic/water-science-school/science/evaporation-and-water-cycle?qt-science_ center_objects¼0#qt-science_center_objects, which slowly increases the amount of water. At the same time, molecules of water often dissociate in the upper atmosphere into hydrogen and oxygen due to ultraviolet light from the sun. Some of the hydrogen atoms have enough energy to escape from Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00009-4 Copyright © 2020 Elsevier Inc. All rights reserved.
261
262
Chapter 9 Decoding back to binary
the earth’s gravitational field, and so are lost. This slowly decreases the amount of water. Fig. 9.1 provides a good illustration of the water’s amazing dynamics phenomenon.
MERIT CyberSecurity Engineering Archives
Thermodynamics 1st Law: Energy can neither be created nor destroyed whenever energy is transformed from one form to another
FIGURE 9.1 It is amazing: energy is indestructible; it only changes from one state to another. Earth’s water budget, the total amount of water on the planet, does not change over time. The hydrologic cycle is a closed system. Water is constantly moving and changing form, but it is neither created nor destroyed. The new picture was annotated to show the trajectory of the water cycle. Picture extracted from MERIT CyberSecurity Group library.
In addition, tectonic plate subduction is constantly carrying water down into the earth’s mantle, and volcanoes are constantly spewing water out onto the surface again. The balance between these two processes can change considerably over time. But all these fluxes are small compared with the total amount of water on the earth, and two of them are in the opposite direction of the other two. So, the overall change is insignificant, even when considered over long spans of geologic time. In telecommunication, we also see how voice goes from digital audio to digital signal across the telecom wire or the web and then back to a digital audio in our phones. Fig. 9.2 is a simple explanation of the process.
Dynamic equilibrium
263
FIGURE 9.2 We encounter this equilibrium in our existence and the matter around us. In telecom, the phone converts (encodes) our voice (analog) into digital data and carries it to our destination over a carrier and decodes it back into voice. In bioinformatics and DNA computing, we have the same phenomenon. Copy from MERIT Engineering.
DNA storage has the same cycle. We start with a binary file it gets converted into DNA code; we store it as ACGT code. Then we do the reverse process and decode it back to binary. Fig. 9.3 illustrates the process.
FIGURE 9.3 An exhibit of the DNA digital storage cycle where digital information is encoded to DNA and encapsulated within silica spheres. Upon release of the DNA from the spheres by fluoride solution, DNA is read into Illumina sequencing and decoded to recover the original information. Picture extracted from MERIT CyberSecurity Group library.
The DNA storage chain has three steps. The first step is the encoding of binary information. The second step is to synthesize the information with A, C, G, T to match the binary sequence. The third step is to decode the data back into its original digital form. Fig. 9.4 is the illustration of the DNA storage and retrieve process.
264
Chapter 9 Decoding back to binary
FIGURE 9.4 Here is another neat diagram that shows the three cycles of the DNA storage chain. The first step is the process of encoding binary data into synthetic, man-made strands of DNA. To store a binary digital file in DNA, the bits (binary digits) are converted from 1s and 0s into the letters A, C, G, and T. These letters represent the four unique nucleotides that make up DNA: adenine, cytosine, guanine, and thymine. In the second step, the physical storage medium is a synthesized chain of DNA containing the A’s, C’s, G’s, and T’s in a sequence corresponding to the order of the bits in the digital file. In the third step, to recover the data, the chain of DNA is sequenced, and the order of A’s, C’s, G’s, and T’s is decoded back into the original digital sequence. Excerpt from “Future of DNA Data Storage” September 2018.
Structure DNA DNA is a molecule composed of two chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth, and reproduction of all known organisms and many viruses. Shape: The DNA molecule is shaped like a ladder that is twisted into a coiled configuration called a double helix, as shown in Fig. 9.5.
The central dogma of genetics
265
FIGURE 9.5 Watson and Crick concluded their 1953 paper on the structure of DNA with the following statement: “It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” Picture extracted from MERIT CyberSecurity Group library.
The “specific pairing” they mention is the hydrogen bonds between adenine and thymine, and guanine and cytosine. The base pairing rules are as follows: A always pairs with T, and G always pairs with C. The “copying mechanism” is DNA replication, a simple yet accurate series of steps by which DNA carries the instructions for its own reproduction. The genetic code is degenerate, i.e., more than one codon can code for a single amino acid. Due to this, of the 64 codons, 61 codons code for the 20 amino acids. There are two punctuation marks in the genetic code called the START and STOP codons, which signal the end of protein synthesis in all organisms.
The central dogma of genetics Almost immediately after the structure of DNA was elucidated by Watson and Crick, the mechanism by which genetic information was maintained within a cell and used to create proteins became apparent. This mechanism has become known as the “central dogma of molecular biology.” The Central dogma has three main parts as shown in Fig. 9.6:
266
Chapter 9 Decoding back to binary
FIGURE 9.6 DNA is stuck in the nucleus because it is a large double helix; this is a problem because proteins are made at the ribosome, which is located outside of the nuclear membrane in the cytoplasm. Messenger RNA (mRNA) makes copy of the code via transcription (think of a waiter taking down your order at a restaurant). Some main differences between DNA and RNA are that RNA is single stranded and does not contain thymine (T) but instead has uracil (U). Since the mRNA is only single stranded, it is small enough to leave the nucleus and does just that. It carries the code of a gene (which codes for one protein to be made) out of the nucleus and to the ribosome (site of protein synthesis). Every three bases on a strand of mRNA represent a codon; they code for a specific amino acid.
1. Genetic information is preserved and transmitted to new cells and offspring by a duplication process called replication. Replication occurs as a part of mitosis; normal cell division reviewed earlier. 2. Genetic information stored in the nucleus is made available to the rest of the cell by the creation of numerous temporary copies known as messenger RNA (mRNA) through a process known as transcription. mRNA is like DNA in that it consists of a long, specific sequence of nucleotides. It differs in that it is single stranded, contains the sugar ribose rather than deoxyribose in its backbone, and utilizes the base uracil in place of thymine. Transcription is a major part of one of the most important aspects of gene expression, the “turning on” of genes in appropriate cells at appropriate times. 3. In the cytoplasm, ribosomes construct specific proteins by interpreting the sequence of bases in mRNA. This process is known as translation. The genetic code that allows ribosomes to assemble the correct amino acids in the correct order is the subject of the following section. Since proteins are the structural core of the cell and since proteins (in the form of enzymes) control nearly all of the cell’s metabolism, the ability to specify protein structure makes DNA the primary determinant of the structure and function of cells. The central dogma is a major organizing principle in molecular biology, and the organization of DNA in cells and genes cannot be fully understood except in its context.
Academic research
267
Key players in DNA synthesis/sequencing and storage Research into DNA data storage is largely performed in academic labs, with the work being funded by the National Science Foundation (NSF), the National Institutes of Health (NIH), the Intelligence Advanced Research Projects Activity (IARPA), and the Defense Advanced Research Projects Agency (DARPA). Although some news articles claim major industry players, such as Google, Facebook, and Apple, are interested in pursuing DNA as a viable storage media, only Microsoft is leading a known R&D effort in this area. Start-ups marketing DNA storage services have also begun to surface this year. In the following, the top players from each sector are highlighted: https://www.nih.gov/grants-funding.
Academic research University of Washington The Molecular Information Systems Lab (MISL) at the University of Washington (UW) https://misl. cs.washington.edu/, led by Dr. Luis Ceze, is a partnership between UW’s Departments of Computer Science and Electrical Engineering, and Microsoft Research. The program brings together faculty, students, and research scientists with expertise in computer architecture, programming languages, synthetic biology, and biochemistry. The UW/Microsoft team holds the current record for the amount of data stored in DNA (w400 MB).
Harvard University Efforts at Harvard University’s Wyss Institute are led by geneticist George Church, https://wyss. harvard.edu/team/core-faculty/george-church/. The Church lab made a huge breakthrough in the field of DNA data storage in 2012, by developing a novel encoding scheme that allowed for synthesis of thousands of oligonucleotides on a single DNA microchip and storing 0.66 MB of data (a book, JPG images, and a JavaScript program). Prior to this work, the largest DNA data storage project (by the J. Craig Venter Institute) had encoded 0.0009 MB of data. George Church and his collaborators have also used CRISPR to encode data sequentially into the genomes of living bacteria.
Columbia University Research at Columbia University on DNA storage was led by Yaniv Erlich, an associate professor of computer science, https://datascience.columbia.edu/yaniv-erlich. Erlich pioneered a technology, known as DNA FOUNTAIN, that optimized the storage density of DNA within 80% of its theoretical limit. In 2017, Erlich took a leave of absence from Columbia University to join MyHeritage as Chief Science Officer.
University of Illinois, UrbanaeChampaign Olgica Milenkovic, professor of Electrical and Computer Engineering, https://ece.illinois.edu/ directory/profile/milenkovleads, leading a $1.5 million effort to produce new DNA-based storage
268
Chapter 9 Decoding back to binary
nanoscale devices using chimeric DNA, a hybrid molecule made from two different sources. SemiSynBio, a 3-year research project funded by a partnership between NSF and the Semiconductor Research Corporation (SRC), aims to design a method to read, write, and store data in a more cost-effective way than current DNA storage techniques. Milenkovic also recently received a 3-year, $2.5 million grant from DARPA to combine synthetic DNA with computing.
Swiss Federal Institute of Technology (ETH), Zurich ETH, Zurich’s DNA data storage research team, led by Robert Grass https://www.researchgate.net/ profile/Robert_Grass, has pioneered a technique for storing DNA in silica, creating a “synthetic fossil” that can preserve DNA and the data it stores for thousands of years, even in extreme conditions.
Research consortium Semiconductor Research Corporation (SRC) is a North Carolina-based, https://www.src.org, nonprofit research consortium focused on microelectronics and semiconductor research. Concerned with the upcoming age of zettabyte data generation, it is leading research efforts focused on the advancement and use of alternative storage technologies such as DNA storage, 5D optical storage, magnetic storage, and cryogenics. SRC conducted a workshop cohosted with IARPA on DNA-based Massive Information Storage in April 2016.
Industry Microsoft Research The most active and visible industry leader in DNA storage technology is Microsoft Research. As a research partner in the University of Washington’s Molecular Information Systems Lab (MISL), Microsoft is pushing the boundaries of DNA data storage capabilities and has demonstrated significant leaps in the past several years in overall storage capacity, https://www.researchgate.net/lab/ Luis-Ceze-Lab.
Micron Technology Top memory manufacturer Micron Technology is also funding DNA digital storage research in collaboration with researchers at Boise State University, as well as through research consortia, to explore the viability of DNA for future storage needs, https://potomacinstitute.org/images/studies/ Future_of_DNA_Data_Storage.pdf.
More Research News reports have suggested that Apple, Facebook, Google, Intel, and IBM are also exploring DNA as a data storage medium. However, their research efforts or technological advancements within this field have not been reported to date.
Intelligence Advanced Research Projects Activity
269
Start-ups Catalog Catalog is a Boston-based start-up, https://catalogdna.com, that raised $9 million in funding in June 2018 in the hopes of providing commercial DNA data storage services. Catalog intends to bypass the DNA synthesis process. Their methodology utilizes a large collection of premade molecules to encode data in DNA. Using their technology, they have stored around 1 KB of data, which includes literary works from Douglas Adams and Robert Frost in DNA. Catalog’s Chief Science Officer, Devin Leake, was most recently Head of DNA Synthesis at Ginkgo Bioworks. Prior to Ginkgo, Leake held positions at Gen9, Thermo Fisher, and Dharmacon.
Iridia Formerly known as Dodo Omni Data, Iridia, http://iridia.com/press-room/, is a start-up based in San Diego working on developing new methods for storing data in DNA. Their technology looks to combine DNA polymer synthesis technology, electronic nanoswitches, and semiconductor fabrication technologies, toward a highly parallel format that enables an array of nanomodules to store data at exceptionally high density. An investor for the start-up is Jay Flately, the founder and executive chairman of Illumina, the world leader in DNA sequencing technology.
Helixworks Founded in 2015, this Irish start-up, https://helix.works/,took in an undisclosed amount of seed funding last year to turn their proprietary DNA data storage technology into a commercial product. Helixworks is set to offer the world’s first commercially available DNA storage drive, which will be available to purchase on Amazon for $199, shortly. Helixworks’ DNA drive offers “512 KB of data storage in specially encoded DNA, encapsulated specially in a custom gold pill, with a guaranteed lifetime (under normal conditions) of 10 years, and a potential shelf-life of thousands.”
US Government Defense Advanced Research Projects Agency The Defense Advanced Research Projects Agency’s (DARPA’s) Molecular Informatics Program, https://www.darpa.mil, announced in March 2017, aims to create a new paradigm for data storage, retrieval, and processing using encoded molecules. The program awarded w$15 million in funding to researchers at Harvard University, Brown University, the University of Illinois, and the University of Washington.
Intelligence Advanced Research Projects Activity The IARPA’s Molecular Information Storage Technology (MIST) program, https://www.iarpa.gov/ index.php/research-programs/mist?id¼1077, aims to develop technology that can write 1 TB of data
270
Chapter 9 Decoding back to binary
and read 10 TB of data per day using DNA. Furthermore, it aims to scale this technology to the exabyte scale with a viable path to commercialization within 10 years. The program started in October 2018.
National Institutes of Health The National Institutes of Health (NIH), https://www.nih.gov, funds individual researchers working on DNA data storage. The recent demonstration by Harvard’s George Church of encoding a small video file in the bacterial genome was funded by NIH.76.
National Science Foundation The National Science Foundation (NSF), https://www.nsf.gov, solicited proposals for DNA information storage research in 2017. Available grants from the NSF totaled up to $4 million.
Foreign research European Bioinformatics Institute The European Union has provided funding to the European Bioinformatics Institute (EBI), www.ebi. ac.uk, to conduct research on DNA as a data storage medium. 77 EBI’s efforts, led by Nick Goldman, date back to 2013 and have pioneered error correction methods.
Decoding DNA sequence back into binary If you have the sequence, TGACTCAGTCGTTCAATCTATGCC, how will the code in MATLAB be converted into binary? The output will be 0 11000110 1110010 0 1111001 0 1110000 0 1110100 0 1101111 using the conversion code (A ¼ 00, T ¼ 01, G ¼ 10, C ¼ 11). If every A matches a T, then the output for an A or a T is the same. The same is true of G and C, they are a pair. So, there are only two pairs, why not try defining each pair as a 1 or a 0? If A and T ¼ 1 and G and C ¼ 0 then the sequence TGACTCAGTGTTCAATCTATGCC would be 101010101001101110111000. By coding each codon with a 1 and a 0, you are going to convolute the bits in your code, and they work as pairs representing a single genetic bit. Scientists have proven that a gram of DNA can store up to 1 zettabyte of information. A nucleotide sequence is made up of four bases (A, C, G, T), which are encoded using a base-4 system (00, 01, 10, 11) and a pair of complementary nucleotides (AeT and CeG) along with a few other components like sugars and phosphates link up to form a DNA, which is the double-helix structure we see so commonly. Fig. 9.7 shows Python code of DNA encoding and decoding process.
A simple simulation to show that DNA sequences can encode and store digital information and can be successfully decoded also. Storing Digital Binary Data into DNA Cells.
Decoding DNA sequence back into binary
FIGURE 9.7
271
272
Chapter 9 Decoding back to binary
To overcome the challenge of decoding and encoding data, we organize data in DNA in a similar fashion to Goldman’s method, as shown in Fig. 9.8. Segmenting the nucleotide representation into blocks, which we synthesize as separate strands, allows storage of large values. Tagging those strands with identifying primers (keys) allows for random access of selected molecules of interest, instead of serial readout.
FIGURE 9.8 The output strand is equipped with keys and address for dynamic random access retrieval. Segmentation Mechanism: Picture extracted from MERIT CyberSecurity Group library.
Encoding data into a single, long strand of DNA is recommended when it comes time to recover the data. A safer process is to encode the data in shorter strands. We then construct the first part of the next strand using the same data found at the end of the previous strand. This way we have multiple copies of the data for comparison. Fig. 9.9 illustrates how the input nucleotides are chopped into small segments and linked together.
FIGURE 9.9 Segmentation of the input nucleotides into short strands and linked together. Random Access Mechanism; Picture extracted from MERIT CyberSecurity Group library.
Copying DNA sequences with polymerase chain reaction
273
We wanted to enrich your vocabulary with some key terms, related to encoding DNA records, which you may find worthwhile: Payload: The string of nucleotides representing the data to be stored is broken into data blocks, whose length depends on the desired strand length and the additional overheads of the format. To aid decoding, two sense nucleotides marked as “S” indicate whether the strand has been reverse-complemented. Address: Each data block is augmented with addressing information to identify its location in the input data string. The address space is in two parts. The high part of the address identifies the key that a block is associated with. The low part of the address indexes the block within the value associated with that key. The combined address is padded to a fixed length and converted to nucleotides as described above. A parity nucleotide is added for basic error detection. Primers: To each end of the strand, we attach primer sequences (act like start/stop). These sequences serve as a “foothold” for the PCR (DNA segment replication) process and allow the PCR to selectively amplify only those strands with a chosen primer sequence. Random access: We exploit primer sequences to provide random access: By assigning different primers to different strands, we can perform sequencing on only a selected group of strands. Existing work on DNA storage uses a single primer sequence for all strands. While this design suffices for data recovery, it is inefficient: The entire pool (i.e., the strands for every key) must be sequenced to recover one value. Polymerase chain reaction (PCR): There was once a great episode of the TV series “Star Trek,” where aliens called “tribbles” got on the ship and kept breeding to the point that the whole ship was filled with tribbles. This example is not too dissimilar to how the PCR (for us nerds) works. You start off with a few pieces of DNA, and in a few hours, you have thousands of copies of the same piece of DNA. But why do scientists need all that DNA? Are they a bunch of eccentrics who drape themselves in strands of DNA or build imaginary friends out of DNA snowmen? Maybe, maybe not.
Copying DNA sequences with polymerase chain reaction Most scientists need a lot of DNA to conduct genetic studies to determine whether or not a person will develop cancer or to artificially make proteins when formulating vaccines or drugs or to study what types of organisms live in the sea (many of which will not grow in a lab). PCR copying of DNA has been a magnificent tool in biology to study almost everything we now know today. Most people say, “the best thing since sliced bread,” but scientists say, “the best thing since PCR.” In 1983, Kary Mullis devised an incredible powerful copying technique called the PCR, a special protein that moves along the DNA sequence to produce a copy, letter by letter. PCR represented a landmark contribution to biology for its ability to amplify DNA and create a DNA profile from even the smallest genetic samples. Because of its relatively low cost, it was an immediate tour de force in genetics applications, most notably forensics. For the invention of PCR in 1983, Kary Mullis was awarded a share of the Nobel Prize for Chemistry in 1993. To provide a random-access method to retrieve DNA sequence, we design a map from keys to unique primer sequences. All strands for an object share a common primer, and different strands with
274
Chapter 9 Decoding back to binary
the same primer are distinguished by their different addresses. Primers allow random access via a PCR, which produces many copies of a piece of DNA in a solution. By controlling the sequences used as primers for PCR, we can dictate which strands in the solution are amplified. To read a key’s value from the solution, we simply perform a PCR process using that key’s primer, which amplifies the selected strands. The sequencing process then reads only those strands, rather than the entire pool. The amplification means sequencing can be faster and cheaper because the probability of recovering the desired object is higher. PCR targets the gene to be copied with primers, single-stranded DNA sequences that are complementary to sequences next to the gene to be copied (see Fig. 9.10). PCR works a little like chain emails. If you get a chain email and send it to two friends, who each send it to two of their friends, and so on, soon everyone has seen the same email. In PCR, first a DNA molecule is copied, then the copies are copied, and so on, until you have 30 billion copies in just a few hours. Fig. 9.11 shows how sequenced records are linked together. PCR follows the same mechanism.
FIGURE 9.10 To begin PCR, the DNA sample that contains the gene to be copied is combined with thousands of copies of primers that frame the gene on both sides. DNA polymerase uses the primers to begin DNA replication and copy the gene. The basic steps of PCR are repeated over and over until you have billions of copies of the DNA sequence between the two primers. PCR, polymerase chain reaction. Extracted from MERIT. CyberSecurity Library.
FIGURE 9.11 DNA Digital Database is like the IT Database Engine. The key will identify the record; the location will identify where it is stored; the length will identify its size; the previous/next will identify its sequence; the start/stop will identify its delimiters. MERIT CyberSecurity Library.
Copying DNA sequences with polymerase chain reaction
275
Molecular information storage (MIST) story The IARPA is a government organization within the Office of the Director of National Intelligence responsible for leading research to overcome difficult challenges relevant to the US Intelligence Community. For more information on the IARPA, visit the website https://www.iarpa.gov/index.php/ about-iarpa. The IARPA is led by a distinguished group of accomplished scientists and researchers. The IARPA takes risks rather than goes for quick wins or sure bets. In high-risk research, failures are inevitable. Failure is acceptable so long as the failure is not due to a lack of technical or programmatic integrity, and the results are fully documented. MIST is one of the projects that was sponsored by the IARPA. The goal of the MIST program is to develop deployable storage technologies that can eventually scale into the exabyte regime and beyond with reduced physical footprint, power, and cost requirements relative to conventional storage technologies. MIST seeks to accomplish this by using sequence-controlled polymers as a data storage medium and by building the necessary devices and information systems to interface with this medium. Technologies are sought to optimize the writing and reading of information to/from polymer media at scale and to support random access of information from polymer media archives at scale. The MIST program is anticipated to have a duration of 4 years composed of two phases, each of which will be 24 months in duration. The desired capabilities for both phases of the program are described by three technical areas (TAs) are as follows: TA1 (storage): Develop a table-top device capable of writing information to molecular media with a target throughput and resource utilization budget. Multiple, diverse approaches are anticipated, which may utilize DNA, polypeptides, synthetic polymers, or other sequence-controlled polymer media. TA2 (retrieval): Develop a table-top device capable of randomly accessing information from molecular media with a target throughput and resource utilization budget. Multiple, diverse approaches are anticipated, which may utilize optical sequencing methods, nanopores, mass spectrometry, or other methods for sequencing polymers in a high-throughput manner. TA3 (operating system): Develop an operating system for use with storage and retrieval devices that coordinates addressing, data compression, encoding, error correction, and decoding of files from molecular media in a manner that supports efficient random access at scale. Multiple, diverse approaches are anticipated, which may draw on established methods from the storage industry or develop new methods to accommodate constraints imposed by polymer media. The result of the program will be technologies that jointly support end-to-end storage and retrieval at the terabyte scale and that present a clear and commercially viable path to future deployment at the exabyte scale. The IARPA anticipates that academic institutions and companies from around the world will participate in this program. The potential for molecular information storage is huge. To put it into a consumer context, molecular technologies could store more than 200 terabits (1012) of data per square inchdthat is 25,000 GB of information stored in something approximately the size of a 50 pence coin, compared with Apple’s latest iPhone 7, with a maximum storage of 256 GB.
276
Chapter 9 Decoding back to binary
Google according to Gartner estimates July 2016 report that Google at the time had 2.5 million servers per data center for 15 data centers around the world. They process an average of 40 million searches per seconddstatistics are provided by Googledresulting in 3.5 billion searches per day and 1.2 trillion searches per year. To deal with all those data, in July 2018, it was reported that Google had approximately 2.5 million servers in each data center and that number was likely to rise. Google reports the energy consumed at such centers could account for as much as 2% of the world’s total greenhouse gas emissions. This means that any improvement in data storage and energy efficiency could also have huge benefits for the environment as well as vastly increasing the amount of information that can be stored. So far, Google is not too keen to do much to benefit the environment.
Just so you know, your DNA can be a wallet for bitcoin data DNA is the carrier of the unique, unchangeable genetic information of all living things, including humans. According to a CNET (computer networks) report published on August 28, 2018, and titled, “Bitcoin Fanatics Are Storing Their Cryptocurrency Passwords in DNA,” the company Carverr, a start-up that provides a service to protect the digital money of its customers, has created a method that makes it possible for those who have a considerable amount invested in bitcoin and other crypto assets to securely safeguard their private keys for even generations to come using its offline, hack-proof synthetic DNA technology. There is tremendous marginal and fringe interest in the possibilities of the blockchain technology behind bitcoin and other cryptocurrencies. Blockchain is a distributed consensus system in which the entire network authenticates transactions. The transactions do not have to be financial; they could as easily form part of new systems for blockchain-based email systems, secure online voting, authoritative timestamping for online content, or contract development. At this point, you may choose to encrypt what must be the most personal information you have, but that will not impede. Genecoin is a service to store DNA sequence code into bitcoin network. Genecoin populates the Bitcoin Blockchain with the sequenced DNA of its customers. The service is offered all over the world. We help you get your DNA sequenced and then upload it into the bitcoin network. So Genecoin is another alternative to convert DNA code into cryptocurrencies. For initial propagation, the company is committed to using the most popular blockchain proponent for its customers, and now that is bitcoin. Cryptocurrency holders can use Carverr synthetic DNA and convert the cryptocurrency back into DNA that gets stored into a tube. Per Carverr, if the liquid remains well refrigerated, it could last for hundreds of years, and once the user is ready to retrieve the information, he will take the microtest tube to a lab sequencing machine, which would, in turn, translate the private keys back to its encrypted form.
BioEdit software, sequence editing BioEdit is a biological sequence alignment editor with an intuitive, multiple-document interface with convenient features that make alignment and manipulation of DNA sequences relatively easy on your desktop computer. Several sequence manipulation and analysis options and links to external analysis programs facilitate a working environment, which allows you to view and manipulate sequences with simple point-and-click operations. Figs. 9.12 through 9.14 show some of BioEdit’ s capabilities. The website gives more information on the subject, http://www.mbio.ncsu.edu/BioEdit/page2.html.
The next-generation sequencing technology
277
FIGURE 9.12 BioEdit is a useful tool for manipulating and analyzing biological sequence data. The official BioEdit website lists the functions available on BioEdit. Storing Digital Binary Data into DNA Cells. The software was tested on trial basis -courtesy BioEdit company management. It is very user friendly and offer a full spectrum of tools as indicated in the picture. Courtesy of https://en.biosoft.bet.
The next-generation sequencing technology Next-generation sequencing (NGS) refers to noneSanger-based high-throughput DNA sequencing technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more throughput and minimizing the need for the fragment-cloning methods that are often used in Sanger sequencing of genomes.
278
Chapter 9 Decoding back to binary
FIGURE 9.13 You can import sequence data from multiple file formats such as MSF, ASN.1, and EMBL. The results of the alignment can be exported to more than 11 formats to be used for other projects or applications. Courtesy of https://en.bio.soft.bet.
Impressive progress has been made in the field of NGS. Through advancements in the fields of molecular biology and technical engineering, parallelization of the sequencing reaction has profoundly increased the total number of produced sequence reads per run. Current sequencing platforms allow for a previously unprecedented view into complex mixtures of RNA and DNA samples. NGS is currently evolving into a molecular microscope finding its way into virtually all fields of biomedical research. We would like to examine the technical background of the different commercially available NGS platforms with respect to template generation and the sequencing reaction and take a small step toward what the upcoming NGS technologies will bring.
Malware Technology Malware is considered a portmanteau for malicious software, which is intentionally designed to cause damage to a computer, server, client, or computer network. The code is described as computer viruses, worms, Trojan horses, ransomware, spyware, adware, and scareware, among other terms.
Malware Technology
279
FIGURE 9.14 A sample screen of BioEdit software as the best editor for sequence alignment. Courtesy of https://en.bio.soft.bet.
The Internet has given us the good, the bad, and the ugly. Fig. 9.15 clearly shows the impact on our societal fabric. In fact, the impact is so globally ubiquitous. The most stupefying phenomenon is that within 30 years, the Internet existence, the Dark Webdlike cancerdoccupied most of the Web. Although the Deep Web will not take over all of the Internet, many applications of the surface Web will have private front ends with virtual pipelines into the Deep Web. As technology keeps producing new applications, malware hackers will be standing on the fence before developing new attack vectors and distributed payloads. So the Internet iceberg is split into three layers. Surface Web: This is the portion where the content that the average users used it on daily basis. This is your Facebook.com, reddit.com, justice.gov, and harvard.edu. Deep Web: Right below the surface of where the Internet iceberg meets the Deep Web. It comprises the same general hostnames as sites on the Surface Web, but along with the extension of those domains. This is the specific URL of your Facebook Messenger thread with a friend, or the Department
280
Chapter 9 Decoding back to binary
FIGURE 9.15 The malware WebSphere surrounds the earth with variable contour that reflects the impact per region and country. For example, the United States has higher malware damage and risk than any other country, followed by South East Asia and Europe. We identified blockchain and DNA digital malware areas with different patterns. Extracted from MERIT CyberSecurity Library.
of Justice’s public archival material, or Harvard’s internal communications system. The Deep Web is most of the Internet as a whole. Dark Web: The dwindling portion at the very bottom of the iceberg is a subset of the Deep Web that is only accessible through software that guards anonymity. Because of this, the Dark Web is home to entities that do not want to be found. To expand on that visual, it is necessary to explain that the Dark Web contains URLs that end in .onion rather than .com, .gov, or .edu. The network that these .onion URLs reside on cannot be accessed with the same browser you use to access your Facebook messages, the justice department’s archive, or your Harvard email account. You can use a simple Chrome or Safari to access these. The Dark Web requires a specific software programdthe Onion Router (the Tor browser)dto do the trick, and it offers you a special layer of anonymity that the Surface Web and the Deep Web cannot. As such, the Dark Web is a place for people and activities who do not want to be found through standard means. It is complete with illegal trade markets and forums, hacking communities, private communications between journalists and whistleblowers, and more.
DNA malware The Two-way malware of binary data and DNA The popularity of DNA storage has offered the malware lords golden opportunities to tear our happy societal fabric with new types of attacks of highest malice. This new disruptive technology that is called can be considered like TCP/IP protocol as shown in Fig. 9.16. Here are two methods that attract the terrorists (bioterrorist) to binary/DNA malice like bees to honey.
Method 1: binary malware, inject into DNA There are organized crime networks where the lords launch their criminal transactions over the Dark and Deep Web. They have established “smart” contracts with professional DNA bioinformaticians who have established repositories of DNA malware “mines” that carry DNA sequences embedded with
FIGURE 9.16
From MERIT CyberSecurity Library.
DNA malware
DNA malware professional after implanting an exploit in DNA stores it in a contaminated DNA sequence storage and then moves it to the Dark Web for obfuscated delivery to healthcare institutions.
281
282
Chapter 9 Decoding back to binary
malicious vectors. The goal is to spread contaminated DNA “mutation” sequences in healthcare institutions, clinics, and research laboratories. Let us understand first what malware is. It is any softwaredJavaScript, Python, JAVA, Visual Basic, Perldspecifically designed to harm a computer, the software it is running, or its users. Malware exhibits malicious behavior that can include installing software without user consent and installing harmful software such as viruses. It can be transmitted through the Internet as an email file, Trojan, backdoor, picture, or YouTube file. Now, malware binary or binary malware is the zero-one version of the malware software. Remember, computers only understand binary code. Malware such as virus, denial of service, ransomware, and spyware can successfully be converted into binary and smuggled in the sequencer as part of the healthy DNA sequence. The problem is that the antivirus technology does not have a software that can detect malware for DNA sequence. This is a major vulnerability that hackers will use to infect healthy DNA sequences to create major havoc in DNA digital storage.
Method 2: infected DNA, converted to binary In bioinformatics and DNA computing, we can go from binary data into DNA code and return to the original data without error. This cyclic phenomenon has opened the wide gate of malware, which we are going to explain here. Well, method 2 of malware will make the devil even happier as shown in Fig. 9.17. We create malware code in the computer and convert it into DNA code. Using BioEdit program, we created the contaminated DNA sequence, and then the contaminated DNA sequence gets
FIGURE 9.17 Using BioEdit professional crime, bioinformaticians can modify a healthy DNA sequence with some synthetic malware sequence and create a new malware-riddled DNA code and send it to wherever the organized crime lords wish to. In the next step, the cyber criminals will convert the sequence to binary and launch it through Dark Web. Furthermore, the cybercriminals could route a series of malware transactions to an adversary blockchain to disrupt the processing of authorized transactions and funnel money through other illegal methods. From MERIT CyberSecurity Library.
Blockchain malware
283
decoded and stored in the digital library; then using the Dark Web, it gets issued to its destination like labs and hospitals and research institutes.
Blockchain malware A rainstorm of grasshoppers; now we have a rainstorm of blockchain Following the Civil War, many settlers came to Kansas in hopes of finding inexpensive land and a better life. By 1874, many of these newly arrived families had broken the prairie and planted their crops. During the spring and early summer months of that year, the state experienced sufficient rains. Eagerly, the farmers looked forward to the harvest. However, during the heat of summer, a drought occurred. Yet this was not the most devastating thing to happen to the farmers that summer. The invasion began in late July when, without warning, millions of grasshoppers, or Rocky Mountain locusts, descended on the prairies from the Dakotas to Texas. The insects arrived in swarms so large they blocked out the sun and sounded like a rainstorm. They ate crops out of the ground, as well as the wool from live sheep and clothing off people’s backs. Paper, tree bark, and even wooden tool handles were devoured. Hoppers were reported to have been several inches deep on the ground, and locomotives could not get traction because the insects made the rails too slippery. Now, here is an enigmatic question: How come Satoshi Nakamoto disappeared from the surface of the earth, while everyone claims he is a great “Thomas Edison”? Why would not the inventor of the virtual currency bitcoin show up and claim his Nobel Prize? An Australian entrepreneur by the name of Craig Wright has claimed he is Satoshi Nakamoto. Mr. Wright, who lives in London, showed the BBC his evidence that he launched the currency back in 2009. Very confusing indeed. Blockchain seems to shift in higher gear of Internet malware. At the Black Hat Asia 2015 conference in Singapore, a statement was presented by INTERPOL and researchers from cybersecurity research firm Kaspersky Labs, which shows that uploading malware to the blockchain would make it extremely hard to get rid of. Indeed, there are “no methods currently available to wipe this data,” according to the statement. Once a file is in the blockchain, and hence on every computer in the bitcoin network, it is there forever. For now, at least. Like the sheep dip ritual, every public and private business organization is embarking upon this new technology, without even a complete feasibility study. There are over 30 companies selling blockchain-based products, and within 5 years, this technology will suffer from disillusionment and failure to deliver. Malware professionals are working overtime to derail this new paradigm. Blockchain companies have to agree on some security standards to protect their customers’ investments as well as their own credibility. Depending on the cryptocurrency and its protocols, there is a fixed open space on the blockchaindthe public ledger of transactionsdwhere data can be stored, referenced, or hosted within encrypted transactions and their records. It is this open space that was identified as the potential target for malware by experts, an INTERPOL officer, and a seconded specialist from Kaspersky Lab, in the Research and Innovation unit at INTERPOL’s Global Complex for Innovation (IGCI). The design of the blockchain means that there is the possibility of malware being injected and permanently hosted with no methods currently available to wipe these data. This could affect cyberhygiene as well as the sharing of child sexual abuse images where the blockchain could become a haven for hosting such data. As we showed in Fig. 9.16, malware code could be sent to the private blockchain network and fused in the central ledger. Soon, the central ledger will be compromised and highjacked.
284
Chapter 9 Decoding back to binary
Appendices Appendix 9.A Fig. 9.18 is very interesting. We are comparing Gartner’s Hype Cycle to the maturity and adoption of the Digital Immunity Ecosystem and DNA storage.
Appendix 9.B Glossary for DNA sequencing (from MERIT Cyber Library) 16S ribosomal DNA (rDNA): The 16S rRNA is a structural component of the bacterial ribosome (part of the 30S small subunit). The 16S rDNA is the gene that encodes this RNA molecule. Owing to its essential role in protein synthesis, this gene is highly conserved across all prokaryotes. There are portions of the 16S gene that are extremely highly conserved, so that a single set of “universal” PCR primers can be used to amplify a portion of this gene from nearly all prokaryotes. The gene also contains variable regions that can be used for taxonomic identification of bacteria. Amplification and taxonomic assignment of 16S rDNA sequences is a widely used method for metagenomic analysis. Algorithm: A step-by-step method for solving a problem (a recipe). In bioinformatics, it is a set of well-defined instructions for making calculations. The algorithm can then be expressed as a set of computer instructions in any software language and implemented as a program on any computer platform. Alignment: See Sequence alignment. Alignment algorithm: See Sequence alignment. Allele: In genetics, an allele is an alternative form of a gene, such as blue versus brown eye color. However, in genome sequencing, an allele is one form of a sequence variant that occurs in any position on any chromosome or a sequence variant on any sequence read aligned to the genomedregardless of its effect on phenotype or even if it is in a gene. In some cases, “allele” is used interchangeably with the term genotype. Amplicon: An amplicon is a specific fragment or locus of DNA from a target organism (or organisms), generally 200e1000 bp in length, copied millions of times by the PCR. Amplicons for a single target (i.e., a reaction with a single pair of PCR primers) can be prepared from a mixed population of DNA templates such as HIV particles extracted from a patient’s blood or total bacterial DNA isolated from a medical or an environmental sample. The resulting deep sequencing provides detailed information about the variants at the target locus across the population of different DNA templates. Amplicons produced from many different PCR primers on many different DNA samples can be combined (with the aid of multiplex barcodes) into a single DNA sequencing reaction on an NGS machine. Assemble: See Sequence assembly. Assembly: See Sequence assembly. BAM file: BAM is a binary sequence file format that uses BZGF compression and indexing. BAM is the binary compressed version of the SAM (sequence alignment/map) format, which contains information about each sequence read in an NGS data set with respect to its alignment position on a reference genome, variants in the read versus the reference genome, mapping quality, and the sequence quality string in an ASCII string that represents PHRED quality scores.
FIGURE 9.18
Appendices
The hype cycle is a branded graphical presentation developed and used by the American research, advisory, and information technology firm Gartner to represent the maturity, adoption, and social application of specific technologies. The hype cycle provides a graphical and conceptual presentation of the maturity of emerging technologies through five phases.
285
286
Chapter 9 Decoding back to binary
BED file: BED is an extremely simple text file format that lists positions on a reference genome with respect to chromosome ID and start and stop positions. NGS reads can be represented in BED format, but only with respect to their position on the reference genome; no information about sequence variants or base quality is stored in the BED file. BLAST: The basic local alignment search tool (BLAST) was developed by Altschul and other bioinformaticians at the NCBI to provide an efficient method for scientists to use similarity-based searching to locate sequences in the GenBank database. BLAST uses a heuristic algorithm based on a hash table of the database to accelerate similarity searches, but it is not guaranteed to find the optimal alignment between any two sequences. BLAST is generally considered to be the most widely used bioinformatics software. BurrowseWheeler transformation (BWT): BWT is a method of indexing (and compressing) a reference genome into a graph data structure of overlapping substrings, known as a suffix tree. It requires a single computational effort to build this graph for a particular reference genome; then it can be stored and reused when mapping multiple NGS data sets to this genome. The BWT method is particularly efficient when the data contain runs of repeated sequences, as in eukaryotic genomes, because it reduces the complexity of the genome by collapsing all copies of repeated strings. BWT works well for alignment of NGS reads to a reference genome because the sequence reads generally match perfectly or are with few mismatches to the reference. BWT methods work poorly when many mismatches and indels are present in the reads, because many alternate paths through the suffix tree must be mapped. Highly cited NGS alignment software that makes use of BWT includes BWA, Bowtie, and SOAP2. Capillary DNA sequencing: This is a method used in DNA sequencing machines manufactured by Life Technologies Applied Biosystems. The technology is a modification of Sanger sequencing that contains several innovations: the use of fluorescent-labeled dye terminators (or dye primers), cycle sequencing chemistry, and electrophoresis of each sample in a single capillary tube containing a polyacrylamide gel. High voltage is applied to the capillaries causing the DNA fragments produced by the cycle sequencing reaction to move through the polymer and separate by size. Fragment sizes are determined by a fluorescent detector, and the bases that comprise the sequence of each sample are called automatically. ChIP-seq: Chromatin immunoprecipitation sequencing uses NGS to identify fragments of DNA bound by specific proteins such as transcription factors and modified histone subunits. Tissue samples or cultured cells are treated with formaldehyde, which creates covalent cross-links between DNA and associated proteins. The DNA is purified and fragmented into short segments of 200e300 bp and then immunoprecipitated with a specific antibody. The cross-links are removed, and the DNA segments are sequenced on an NGS machine (usually Illumina). The sequence reads are aligned to a reference genome, and protein-binding sites are identified as sites on the genome with clusters of aligned reads. Cloning: In the context of DNA sequencing, DNA cloning refers to the isolation of a single purified fragment of DNA from the genome of a target organism and the production of millions of copies of this DNA fragment. The fragment is usually inserted into a cloning vector, such as a plasmid, to form a recombinant DNA molecule, which can then be amplified in bacterial cells. Cloning requires significant time and hands-on laboratory work and creates a bottleneck for traditional Sanger sequencing projects. Consensus sequence: When two or more DNA sequences are aligned, the overlapping portions can be combined to create a single consensus sequence. In positions where all overlapping sequences have the same base (a single column of the multiple alignment), that base becomes the consensus. Various
Appendices
287
rules may be used to generate the consensus for positions where there are disagreements among overlapping sequences. A simple majority rule uses the most common letter in the column as the consensus. Any position where there is disagreement among aligned bases can be written as the letter N to designate “unknown.” There is also a set of IUPAC ambiguity codes (YRWSKMDVHB) that can be used to specify specific sets of different DNA bases that may occupy a single position in the consensus. Contig: A contiguous stretch of DNA sequence that is the result of assembly of multiple overlapping sequence reads into a single consensus sequence. A contig requires a complete tiling set of overlapping sequence reads spanning a genomic region without gaps. Coverage: The number of sequence reads in a sequencing project that align to positions that overlap a specific base on a target genome, or the average number of aligned reads that overlap all positions on the target genome. De Bruijn graph: This is a graph theory method for assembling a long sequence (like a genome) from overlapping fragments (like sequence reads). The de Bruijn graph is a set of unique substrings (words) of a fixed length (a k-mer) that contain all possible words in the data set exactly once. For genome assembly, the sequence reads are split into all possible k-mers, and overlapping k-mers are linked by edges in the graph. Reads are then mapped onto the graph of overlapping k-mers in a single pass, greatly reducing the computational complexity of genome assembly. De novo assembly: See De novo sequencing. De novo sequencing: The sequencing of the genome of a new, previously unsequenced organism or DNA segment. This term is also used whenever a genome (or sequence data set) is assembled by methods of sequence overlap without the use of a known reference sequence. De novo sequencing might be used for a region of a known genome that has significant mutations and/or structural variation from the reference. Diploid: A cell or organism that contains two copies of every chromosome, one inherited from each parent. DNA fragment: A small piece of DNA, often produced by a physical or chemical shearing of larger DNA molecules. NGS machines determine the sequence of many DNA fragments simultaneously. Exon: A portion of a gene that is transcribed and spliced to form the final messenger RNA (mRNA). Exons contain protein-coding sequence and untranslated upstream and downstream regions (30 UTR and 50 UTR). Exons are separated by introns, which are sequences that are transcribed by RNA polymerase, but spliced out after transcription and not included in the mature mRNA. FASTA format: This is a simple text format for DNA and protein sequence files developed by William Pearson in conjunction with his FASTA alignment software. The file has a single header line that begins with a “>” symbol followed by a sequence identifier. Any other text on the first line is also considered the header, and any text following the first carriage return/line feed is considered part of the sequence. Multiple sequences can be stored in the same text file by adding additional header lines and sequences after the end of the first sequence. FASTQ file: A text file format for NGS reads that contains both the DNA sequence and quality information about each base. Each sequence read is represented as a header line with a unique identifier for each sequence read and a line of DNA bases represented as text (GATC), which is very similar to the FASTA format. A second pair of lines is also present for each read, another header line and then a line with a string of ASCII symbols, equal in length to the number of bases in the read, which encode the PHRED quality score for each base.
288
Chapter 9 Decoding back to binary
Fragment assembly: To determine the complete sequence of a genome or large DNA fragment, short sequence reads must be merged. In Sanger sequencing projects, overlaps between sequence reads are found and aligned by similarity methods; then consensus sequences are generated and used to create contigs. Eventually a complete tiling of contigs is assembled across the target DNA. In NGS, there are too many sequence reads to search for overlaps among them all (a problem with exponential complexity). Alternate algorithms have been developed for de novo assembly of NGS reads, such as de Bruijn digraphs, which map all reads to a common matrix of short k-mer sequences (a problem with linear complexity). GenBank: The international archive of DNA and protein sequence data maintained by the National Center for Biotechnology Information (NCBI), a division of the US National Library of Medicine. GenBank is part of a larger set of online scientific databases maintained by the NCBI, which includes the PubMed online database of published scientific literature, gene expression, sequence variants, taxonomy, chemicals, human genetics, and many software tools to work with these data. Heterozygote: Humans and most other eukaryotes are diploid, meaning that they carry two copies of each chromosome in every somatic cell. Therefore, each individual carries two copies of each gene, one inherited from each parent. If the two copies of the gene are different (i.e., different alleles of that gene), then the person is said to be a heterozygote for that gene. A homozygote has two identical copies of that gene. In genome sequencing, every base of every chromosome can be considered as a separate data point; thus, any single base can be genotyped as heterozygous or homozygous in that individual. High-performance computing (HPC) (Chapter 12): HPC provides computational resources to enable work on challenging problems that are beyond the capacity and capability of desktop computing resources. Such large resources include powerful supercomputers with massive numbers of processing cores that can be used to run high-end parallel applications. HPC designs are heterogeneous but generally include multicore processors, multiple CPUs within a single computing device or node, graphics processing units (GPUs), and multiple nodes grouped in a cluster interconnected by high-speed networking systems. The most powerful current supercomputers can perform several quadrillion (1015) operations per second (petaflops). Trends for supercomputing architecture are for greater miniaturization of parallel processing units, which saves energy (and reduces heat), speeds message passing, and allows for access to data in shared memory caches. Histone: In eukaryotic cells, the DNA in chromosomes is organized and protected by wrapping around a set of scaffold proteins called histones. Histones are composed of six different proteins (H1, H2A, H2B, H3, H4, and H5). Two copies of each histone bind together to form a spool structure. DNA winds around the histone core about 1.65 times, using a length of 147 bp to form a unit known as the nucleosome. Methylation and other modifications of the histone proteins affect the structure and function of DNA (epigenetics). Human Genome Project (HGP): An international effort including 20 sequencing centers in China, France, Germany, Great Britain, Japan, and the United States, coordinated by the US Department of Energy and the National Institutes of Health, to sequence the entire human genome. The effort formally began in 1990 with the allocation of funds by Congress and the development of high-resolution genetic maps of all human chromosomes. The project was formally completed in two stages: the “working draft” genome in 2000 and the “finished” genome in 2003. The 2003 version of the genome was declared to have fewer than one error per 10,000 bases (99.99% accuracy), an average contig size of >27 million bases, and to cover 99% of the gene-containing regions of all chromosomes. In addition, the HGP was responsible for large improvements in DNA sequencing technology,
Appendices
289
mapping more than 3 million human SNPs, and genome sequences for Escherichia coli, fruit fly, and other model organisms. Human Microbiome Project: An effort coordinated by the US National Institutes of Health to profile microbes (bacteria and viruses) associated with the human bodydfirst to inventory the microbes present at various locations inside and outside the body and the normal range of variation in healthy people, and then to investigate changes in these microbial populations associated with disease. Illumina sequencing: The NGS sequencing method developed by the Solexa company and then acquired by Illumina Inc. This method uses “sequencing by synthesis” chemistry to simultaneously sequence millions of w300-bp-long DNA template molecules. Many sample preparation protocols are supported by Illumina including whole-genome sequencing (by random shearing of genomic DNA), RNA sequencing, and sequencing of fragments captured by hybridization to specific oligonucleotide baits. Illumina has aggressively improved its system through many updates, at each stage generally providing the highest total yield and greatest yield of sequence per dollar of commercially available DNA sequencers each year, leading to a dominant share of the NGS market. Machines sold by Illumina include the Genome Analyzer (GA, GAII, GAIIx), HiSeq, and MiSeq. At various times, with various protocols, Illumina machines have produced NGS reads of 25, 36, 50, 75, 100, and 150 bp as well as paired-end reads. Indels: Insertions or deletions in one DNA sequence with respect to another. Indels may be a product of errors in DNA sequencing, the result of alignment errors, or true mutations in one sequence with respect to anotherdsuch as mutations in the DNA of one patient with respect to the reference genome. In the context of NGS, indels are detected in sequence reads after alignment to a reference genome. Indels are called in a sample (i.e., a patient’s genome) after variant detection has established a high probability that the indel is present in multiple reads with adequate coverage and quality, and not the result of errors in sequencing or alignment. Intron: A portion of a gene that is spliced out of the primary transcript of a gene and not included in the final messenger RNA (mRNA). Introns separate exons, which contain the protein-coding portions of a gene. ktup, k-tuple, or k-mere: A short word composed of DNA symbols (GATC) that is used as an element of an algorithm. A sequence read can be broken down into shorter segments of text (either overlapping or nonoverlapping words). The length of the word is called the ktup size. Very fast exact matching methods can be used to find words that are shared by multiple sequence reads or between sequence reads and a reference genome. Word matching methods can use hash tables and other data structures that can be manipulated much more efficiently by computer software than sequence reads represented by long text strings. Mate-pair sequencing: Mate-pair sequencing is similar to paired-end sequencing; however, the size of the DNA fragments used as sequencing templates is much longer (1000e10,000 bp). To accommodate these long-template fragments on NGS platforms such as Illumina, additional sample preparation steps are required. Linkers are added to the ends of the long fragments, and then the fragments are circularized. The circular molecules are then sheared to generate new DNA fragments at an appropriate size for construction of sequencing libraries (200e300 bp). From this set of sheared fragments, only those fragments containing the added linkers are selected. These selected fragments contain both ends of the original long fragment. New primers are added to both ends, and standard paired-end sequencing is performed. The orientation of the paired sequence reads after mapping to the genome is opposite from a standard paired-end method (outward facing rather than inward facing).
290
Chapter 9 Decoding back to binary
Mate-pair methods are particularly valuable for joining contigs in de novo sequencing and for detecting translocations and large deletions (structural variants). Metagenomics: The study of complete microbial populations in environmental and medical samples. Often conducted as a taxonomic survey using direct PCR (with universal 16S primers) of DNA extracted from environmental samples. Shotgun metagenomics sequences all DNA in these samples and then attempts both taxonomic and functional identification of genes encoded by microbial DNA. Microarray: A collection of specific oligonucleotide probes organized in a grid pattern of microscopic spots attached to a solid surface, such as a glass slide. The probes contain sequences from known genes. Microarrays are generally used to study gene expression by hybridizing labeled RNA extracted from an experimental sample to the array and then measuring the intensity of signal in each spot. Microarrays can also be used for genotyping by creating an array of probes that match alternate alleles of specific sequence variants. Multiple alignment: A computational method that lines up, as a set of rows of text, three or more sequences (of DNA, RNA, or proteins) to maximize the identity of overlapping positions while minimizing mismatches and gaps. The resulting set of aligned sequences is also known as a multiple alignment. Multiple alignments may be used to study evolutionary information about the conservation of bases at specific positions in the same gene across different organisms or about the conservation of regulatory motifs across a set of genes. In NGS, multiple alignment methods are used to reduce a set of overlapping reads that have been mapped to a region of a reference genome by pairwise alignment and to a single consensus sequence and also to aid in the de novo assembly of novel genomes from sets of overlapping reads created by fragment assembly methods. Next-generation (DNA) sequencing (NGS): DNA sequencing technologies that simultaneously determine the sequence of DNA bases from many thousands (or millions) of DNA templates in a single biochemical reaction volume. Each template molecule is affixed to a solid surface in a spatially separate location and then amplified to increase signal strength. The sequences of all templates are determined in parallel by the addition of complementary nucleotide bases to a sequencing primer coupled with signal detection from this event. Paired-end read: See Paired-end sequencing. Paired-end sequencing: A technology that obtains sequence reads from both ends of a DNA fragment template. The use of paired-end sequencing can greatly improve de novo sequencing applications by allowing contigs to be joined when they contain read pairs from a single template fragment, even if no reads overlap. Paired-end sequencing can also improve the mapping of reads to a reference genome in regions of repetitive DNA (and detection of sequence variants in those locations). If one read contains repetitive sequence, but the other maps to a unique genome position, then both reads can be mapped. Phred score: The Phred software was developed by Phil Green and coworkers working on the Human Genome Project to improve the accuracy of base calling on ABI sequencers (using fluorescent Sanger chemistry). Phred assigns a quality score to each base, which is equivalent to the probability of error for that base. The Phred score is the negative log (base 10) of the error probability; thus a base with an accuracy of 99% receives a Phred score of 20. Phred scores have been adopted as the measure of sequence quality by all NGS manufacturers, although the estimation of error probability is done in many ways (in some cases with questionable validity).
Appendices
291
Poisson distribution: A random probability distribution in which the mean is equal to the variance. This distribution describes rare events that occur with equal probability across an interval of time or space. In NGS, sequence reads obtained from sheared genomic DNA are often assumed to be Poissondistributed across the genome. Pyrosequencing: A method of DNA sequencing developed in 1996 by Nyre´n and colleagues that directly detects the addition of each nucleotide base as a template is copied. The method detects light emitted by a chemiluminescent reaction driven by the pyrophosphate that is released as the nucleotide triphosphate is covalently linked to the growing copy strand. Each type of base is added in a separate reaction mix, but terminators are not used; thus, a series of identical bases (a homopolymer) creates multiple covalent linkages and a brighter light emission. This chemistry is used in the Roche 454 sequencing machines. Reference genome: A curated consensus sequence for all of the DNA in the genome (all of the chromosomes) of a species of organism. Because the reference genome is created as the synthesis of a variety of different data sources, it may occasionally be updated; thus, a particular instance of that reference is referred to by a version number. Reference sequence: The formally recognized, official sequence of a known genome, gene, or artificial DNA construct. A reference sequence is usually stored in a public database and may be referred to by an accession number or other shortcut designation, such as human genome hg19. An experimentally determined sequence produced by an NGS machine may be aligned and compared with a reference sequence (if one exists) to assess accuracy and to find mutations. Repetitive DNA: DNA sequences that are found in identical duplicates many times in the genome of an organism. Some repetitive DNA elements are found in genomic features such as centromeres and telomeres with important biological properties. Other repetitive elements such as transposons are similar to viruses that copy themselves into many locations on the genome. Simple sequence repeats are another type of repetitive element comprised of linear repeats of 1-, 2-, or 3-base patterns such as CAGcagCAGcag.. A short sequence read that contains only repetitive sequence may align to many different genomic locations, which creates problems with de novo assembly, mapping of sequence fragments to a reference genome, and many related applications. Ribosomal DNA (rDNA): Genes that code for ribosomal RNA (rRNA) are present in multiple copies in the genomes of all eukaryotes. In most eukaryotes, the rDNA genes are present in identical tandem repeats that contain the coding sequences for the 18S, 5.8S, and 28S rRNA genes. In humans, a total of 300e400 rDNA repeats are located in regions on chromosomes 13, 14, 15, 21, and 22. These regions form the nucleolus. Additional tandem repeats of the coding sequence for the 5S rRNA are located separately. rRNA is a structural component of ribosomes and is not translated into protein. The rRNA genes are highly transcribed, contributing >80% of the total RNA found in cells. RNA sequencing methods generally include purification steps to remove rRNA or to enrich mRNA from protein-coding genes. Ribosomal RNA (rRNA): See Ribosomal DNA. RNA-seq: The sequencing of cellular RNA, usually used as a method to measure gene expression but also used to detect sequence variants in transcribed genes, alternative splicing, gene fusions, and allele-specific expression. For novel genomes, RNA-seq can be used as experimental evidence to identify expressed regions (coding sequences) and map exons onto contigs and scaffolds. Roche 454 Genome Sequencer: DNA sequencers developed in 2004 by 454 Life Sciences (subsequently purchased by Roche) were the first commercially available machines that used
292
Chapter 9 Decoding back to binary
massively parallel sequencing of many templates at once. These “next-generation sequencing” (NGS) machines increased the output (and reduced the cost) of DNA sequencing by at least three orders of magnitude over sequencing methods that used Sanger chemistry but produced shorter sequence reads. 454 machines use beads to isolate individual template molecules and an emulsion PCR system to amplify these templates in situ and then perform the sequencing reactions in a flow cell that contains millions of tiny wells that each fits exactly one bead. 454 uses pyrosequencing chemistry, which has very few base substitution errors, but a tendency to produce insertion/deletion errors in stretches of homopolymer DNA. SAM/BAM: See BAM file. Sanger sequencing method: The method developed by Frederick Sanger in 1975 to determine the nucleotide sequence of cloned, purified DNA fragments. The method requires that DNA be denatured into single strands, then a short oligonucleotide sequencing primer is annealed to one strand, and DNA polymerase enzyme extends the primer, adding new complementary deoxynucleotides one at a time, creating a copy of the strand. A small amount of a dideoxynucleoside is included in the reaction, which causes the polymerase to terminate, creating truncated copies. In a reaction with a single type of dideoxynucleotide, all fragments of a specific size will end with the same base. Four separate reactions containing a single dideoxynucleotide (ddG, ddA, ddT, and ddC) must be conducted, and then all four reactions are run on four adjacent lanes of a polyacrylamide gel. The actual sequence is determined from the length of the fragments, which correspond to the position where a dideoxynucleotide was incorporated. Sequence alignment: An algorithmic approach to find the best matching of consecutive letters in one sequence (text symbols that represent the polymer subunits of DNA or protein sequences) with another. Generally sequence alignment methods balance gaps with mismatches, and the relative scoring of these two features can be adjusted by the user. Sequence assembly: A computational process of finding overlaps of identical (or nearly identical) strings of letters among a set of sequence fragments and iteratively joining them together to form longer sequences. Sequence fragment: A short string of text that represents a portion of a DNA (or RNA) sequence. NGS machines produce short reads that are sequence fragments that are read from DNA fragments. Sequence read, short read: When DNA sequence is obtained by any experimental method, including both Sanger and next-generation methods, the data are obtained from individual template molecules as a string of nucleotide bases (represented by the letter symbols G, A, T, C). This string of letters is called a sequence read. The length of a sequence read is determined by the technology. Sanger reads are typically 500e800 bases long, Roche 454 reads 200e400 bases, and Illumina reads may be 25e200 bases (depending on the model of machine, reagent kit, and other variables). Sequence reads produced by NGS machines are often called short reads. Sequence variants: Differences at specific positions between two aligned sequences. Variants include single-nucleotide polymorphisms (SNPs), insertions and deletions, copy number variants, and structural rearrangements. In NGS, variants are found after alignment of sequence reads to a reference genome. A variant may be observed as a single mismatched base in a single sequence read, or it may be confirmed by variant detection software from multiple sources of data. Sequencing primer: A short single-stranded oligonucleotide that is complementary to the beginning of a fragment of DNA that will be sequenced (the template). During sequencing, the primer anneals to the template DNA, and then DNA polymerase enzyme adds additional nucleotides that
Appendices
293
extend the primer, forming a new strand of DNA complementary to the template molecule. DNA polymerase cannot synthesize new DNA without a primer. In traditional Sanger sequencing, the sequencing primer is complementary to the plasmid vector used for cloning; in NGS, the primer is complementary to a linker that is ligated to the ends of template DNA fragments. Sequencing by synthesis: This is the term used by Illumina to describe the chemistry used in its NGS machines (Illumina Genome Analyzer, HiSeq, MiSeq). The biochemistry involves a singlestranded template molecule, a sequencing primer, and DNA polymerase, which adds nucleotides one by one to a DNA strand complementary to the template. Nucleotides are added to the templates in separate reaction mixes for each type of base (GATC), and each synthesis reaction is accompanied by the emission of light, which is detected by a camera. Each nucleotide is modified with a reversible terminator, so that only one nucleotide can be added to each template. After a cycle of four reactions adding just one G, A, T, or C base to each template, the terminators are removed so that another base can be added to all templates. This cycle of synthesis with each of the four bases and removal of terminators is repeated to achieve the desired read length. SFF file: Standard flow gram format is a file type developed by Roche 454 for the sequencing data produced by the NGS machine. The SFF file contains both sequence and quality information about each base. The format was initially proprietary but has been standardized and made public in collaboration with the international sequence databases. SFF is a binary format and requires custom software to read it or convert it to human-readable text formats. Shotgun sequencing: A strategy for sequencing novel or unknown DNA. Many copies of the target DNA are sheared into random fragments, and then primers are added to the ends of these fragments to create a sequencing library. The library is sequenced by high-throughput methods to generate a large number of DNA sequence reads that are randomly sampled from the original target. The target DNA is reconstructed using a sequence assembly algorithm that finds overlaps between the sequence reads. This method may be applied to small sequences such as cosmid and BAC clones or to entire genomes. SmitheWaterman alignment: A rigorous optimal alignment method for two sequences based on dynamic programming. This method always finds the optimal alignment between two sequences, but it is slow and very computationally demanding because it computes a matrix of all possible alignments with all possible gaps and mismatches. The size of this matrix increases with the square of the lengths of the sequences to be aligned, and it requires huge amounts of memory and CPU time to work with genome-sized sequences. SOLiD sequencing: The Applied Biosystems division of Life Technologies Inc. purchased the SOLiD (Supported Oligo Ligation Detection) technology from the biotech company Agencourt Personal Genomics and released the first commercial version of this NGS machine in 2007. The technology is fundamentally different from any other Sanger or NGS method in that it uses ligation of short fluorescently labeled oligonucleotides to a sequencing primer rather than DNA polymerase to copy a DNA template. Sequences are detected two bases at a time, and then base calls are made based on two overlapping oligos. Raw data files use a “color space” system that is different from the base calls produced by all other sequencing systems and requires different informatics software. This system has some interesting built-in error correction algorithms but has failed to show superior overall accuracy in the hands of customers. The yield of the system is similar to that of Illumina NGS machines. Variant detection: NGS is frequently used to identify mutations in DNA samples from individual patients or experimental organisms. Sequencing can be done at the whole genome scale; RNA-seq, which targets expressed genes; exome capture, which targets specific exon regions captured by
294
Chapter 9 Decoding back to binary
hybridization to probes of known sequence; or amplicons for genes or regions of interest. In all cases, sequence variants are detected by alignment of NGS reads to a reference sequence and then identification of differences between the reads and the reference. Variant detection algorithms must distinguish between random sequencing errors, differences caused by incorrect alignment, and true variants in the genome of the target organism. Various combinations of base quality scores, alignment quality scores, depth of coverage, variant allele frequency, and the presence of nearby sequence variants and indels are all used to differentiate true variants from false positives. Recent algorithms have also made use of machine learning methods based on training sets of genotype data or large sets of samples from different patients/organisms that are sequenced in parallel with the same sample preparation methods on the same NGS machines.
Suggested readings Bitcoin Malware. https://www.csoonline.com/article/3269871/security/bitcoin-network-3-to-10-times-moreevil-than-the-rest-of-the-internet.html. DNA Data Storage Future. http://www.potomacinstitute.org/images/studies/Future_of_DNA_Data_ Storage.pdf. Encryption Based on DNA. http://www.serialsjournals.com/serialjournalmanager/pdf/1481105391.pdf Topol, Eric, March 12, 2019. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again, first ed. Basic Books. Google Storage Read more at:http://hyperphysics.phy-astr.gsu.edu/hbase/Organic/transcription.html. How to Build mRNA. http://hyperphysics.phy-astr.gsu.edu/hbase/Organic/transcription.html. MIST. https://www.iarpa.gov/index.php/research-programs/mist?id¼1077. Next Generation of Sequencing. https://ac.els-cdn.com/S092544391400180X/1-s2.0-S092544391400180Xmain.pdf?_tid¼ e02055f6-3369-48dd-b098-6e4419e51f1d&acdnat¼1546935292_f bb65513ae 7380c47e605e7014e93dab. Part1 Genecoin. https://medium.com/@bernardoagmonte/let-me-tell-you-the-story-of-genecoin-ffe03c23104b Part2 Genecoin https://medium.com/@bernardoagmonte/why-genecoin-53ed18511049. Ramakrishnan, V., November 6, 2018. Gene Machine: The Race to Decipher the Secrets of the Ribosome, first ed. Basic Books. Ridley, M., May 30, 2006. Genome: The Autobiography of a Species, Reprint edition. Harper Perennial. Start-Stop Codon. https://www.news-medical.net/life-sciences/START-and-STOP-Codons.aspx. The Bitcoin Magazine. https://bitcoinmagazine.com/articles/genecoin-dna-for-the-blockchain-1415660431/The Deep Web https://darkwebnews.com/help-advice/access-dark-web/. US intelligence researching DNA for exabytes for data storage. https://sociable.co/technology/dna-data- storage/. Watson and Crick, https://www.exploratorium.edu/origins/coldspring/printit.html. Watson, J.D., Berry, A., April 1, 2003. DNA: The Secret of Life, first ed. Knopf.
CHAPTER
Fusing DNA with digital immunity ecosystem
10
How did human immunity come about? What if we could travel through the microscopic world and enter the amazing environment of a single, solitary, minuscule human celldjust one of the trillions in your body? What an amazing world of wonder we would encounter! There, within this infinitesimally small and elegantly ordered domain, we would see complicated molecular machinery busily carrying out the functions that make our lives on earth possible! Traveling to the heart of the celldits nucleusdwe would find the “brains” of this invisible, unfathomably small world: an incredibly thin, unbelievably long strand of atoms that form a molecule, the existence of which is truly a miracle. dWallace G. Smith, Tomorrow’s World, May 2013.
I believe things like DNA computing will eventually lead the way to a “molecular revolution,” which ultimately will have a very dramatic effect on the world. dL. Adleman.
Plague at the Siege of Caffa, 1346 Black Death disease spread in Europe in a very mysterious way. We can blame the Mongols for their terror tactics during their siege of Caffa. The battle of Caffa, has a historical significance, because it transmitted Plague to Europe. The dying Tartars stunned and stupefied by the immensity of the disaster brought about by the Black Death disease and, realizing that they had no hope of escape, lost interest in the siege of the city of Caffa. They ordered infected corpses to be placed in catapults and lobbed into the city in the hope that the intolerable stench would kill everyone inside. As soon as the rotting corpses tainted the air and poisoned the water supply, and the stench was so overwhelming, the citizens of Caffa knew they could not hide, flee, or escape. No one knew, or could discover, a means of defense. Imagine being besieged for months and then having rotting corpses flung at you. By choosing the strongest smelling corpses, it is likely that when they were hurled, they came down hard and very sticky. The landing of corpses drew rats from within the fortress. It was the beginning of the outbreak. Eventually, the City of Caffa fell under Genghis Khan in 1436. The Battle of Caffa was documented in Arabic as shown in (Fig. 10.1) . Following the miasma theory of contagion, infection is passed through the contamination of the air and water and food supply. The intolerable stench would kill everyone inside. The contagion is in the Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00010-0 Copyright © 2020 Elsevier Inc. All rights reserved.
295
296
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.1 The invading Tatar army under Genghis Khan suffered an outbreak of the Bubonic plague. During his siege of the city Caffa (now called Theodosia), is a port and resort, on the Black Sea coast. The Tartars come up with a devilish idea to devastate their enemy by flinging the plague-stricken corpses of their fallen brethren over the city walls to purposely infect their enemy. The plan worked, and the inhabitants of Kaffa were forced to surrender their city to the Mongol invaders. It is believed that some of the survivors of the initial attack left Kaffa for Constantinople and other ports in the Mediterranean, which contributed to the pandemic known as the Black Death.
foul-smelling air that can also permeate and contaminate food or water. It could be transmitted person to person. Without understanding germ theory, they could not understand what was actually being passed but did recognize respiratory and oral routes of transmission. Using corpses to foul land and water is a method of biological warfare that is probably as old as warfare itself. Among those who escaped from Caffa by boat were a few sailors who had been infected with the poisonous disease. Some boats were bound for Genoa; others for Venice and to other Christian areas. When sailors reached these places and mixed with people there, it was as if they had brought evil spirits with them: every city, every settlement, and their inhabitants, both men and women, died suddenly. The 10 survivors from 1000 sailors, relatives, kinsmen, and neighbors flocked to them from all sides. But the survivors were carrying darts of death. While they hugged and kissed, they were spreading poison from their lips even as they spoke.
The plague of Athens, 430 BC Throughout history, we are finding out that human’s most worrisome fear was sickness. Once a virus attacks the body, then it is a fight between God and the Devil. Longevity was a mysterious and unexplainable quandary. Immunity was unknown and like a touch of luck. During the plague of Athens in 430 BCE. As shown in Fig. 10.2, describes the misery of man and his vulnerability without the magic of immunity the historian Thucydides noted that people who had recovered from a previous bout of the
DNA digital storage meets the Digital Immunity Ecosystem
297
FIGURE 10.2 The Greeks realized that the people who had previously survived the plague did not contract the disease a second time.
disease could nurse the sick without contracting the illness a second time. It took 2287 years before Louis Pasteur, in 1857, developed the “acquired immunity” through vaccination. Without acquired immunity, the human race would have been in serious jeopardy.
DNA digital storage meets the Digital Immunity Ecosystem Digital Immunity Ecosystem (DIE) is the smart platform that was designed to protect DNA digital storage and to protect the critical systems in our smart cities. We purposely used the term “ecosystem” to include all the other exogenous components of DIE such as smart city, smart grids, cloud, satellites, and knowledge engines. Figs. 10.3 and 10.4 are the two compelling diagrams that give the reader a better perspective on the architecture of DIE.
FIGURE 10.3 The Digital Immunity Ecosystem (DIE) is a system with immense complexity with massive responsibility to guard, defend, and immunize all the critical systems on the city smart grid. DNA will be riveted to the DIE like a permanent fastener to securely manage the storage of all the information processed by it. 2018 Copyright [MERIT CyberSecurity Group]; All rights are reserved.
298
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.4 This 3D isometric diagram shows the smart components of DIE, including the DNA digital storage. The system is more than the sum of its parts. Each component has the ultimate intelligence and the learning ability and instantaneously responds to incoming commands. Sometimes, the component responds on its own with its artificial intelligent (AI) judgment of deep learning. DIE, Digital Immunity Ecosystem.
Anatomy of Digital Immunity Ecosystem and its intelligent components The DIE has an interesting technical alias: the Cognitive Early Warning Predictive System (CEWPS), which we often use throughout the book. CEWPS focuses on the components of the system. The DIE represents the system and the environment of the smart city. It is designed specifically to defend, by offense, smart cities and large metropolitan areas supported by critical infrastructures. DIE is an arsenal of subsystems that operate autonomically in machine learning mode or task-specific algorithms as seen in Fig. 10.4. All the components communicate through a smart nanogrid. DIE has nine subsystems headed by the central coordination center (CCC), which acts as the nervous center and control hub. Fig. 10.5 shows how the nine cognitive subsystems are interlinked and make up one holistic system. We purposely used the term ecosystem to include all the other exogenous components of DIE such as smart city, smart grids, cloud, satellites, and knowledge engines. There are more illustrative diagrams that we are going to post, which will give the reader better more clear picture on the structure of DIE.
Anatomical composition of Digital Immunity Ecosystem
299
FIGURE 10.5 This 3D diagram of DIE shows all the components pathologically connected to one another. The term “pathologically” refers to cognitive connectivity with AI service messages among the federated subsystems. AI, artificial intelligence; DIE, Digital Immunity Ecosystem.
Anatomical composition of Digital Immunity Ecosystem Component 1: the central coordination center The CCC is the nerve center of DIE; it is the brain of DIE. The center has several technical responsibilities that have to be met for the center to guarantee adequate protection and safety of the critical system. It is aware of all the security activities in the yard. It receives status data from all the subordinate systems and dispatches instructions and performatives based on the situation. DIE comes with a multiscreen information dashboard, showing how the subsystems operate under normal conditions. The main objective of the main dashboard (MD) as shown Fig. 10.6 is to maintain optimum levels of operation for all the subscribed infrastructure critical systems, catch all exceptions and abnormalities ahead of time, and most importantly, dispatch rapid response alerts. The MD is the eyes, or the controls, of the cockpit of a jumbo jet, for DIE. DIE draws its predictive clout from its accumulative knowledge reservoirs. DIE may collaborate with a group of prominent antivirus technology (AVT) providers who will provide data on the latest
300
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.6 This is the CEWPS dashboard; it is the “brain” of the system that connects and supervises all different components and receives real-time status. Pretty much like the brain with its sensory and motor neurons. The dashboard is also connected to the city grid that connects all the critical systems together. The highlighted buttons are to activate the DNA data storage. CEWPS, Cognitive Early Warning Predictive System.
cyberattacks in the world. Building a synergistic alliance with AVT providers gives DIE higher credibility and, more importantly, feeds DIE with a steady flow of fresh malware data ready to be converted into vaccine knowledge.
Component 2: the knowledge acquisition component The purpose of the data distillation process is to collect all cybercrime casesdthe whole cycle and normalize them into the attack knowledge base (AKB) before DIE can perform predictions. The Causality Reasoning Engine (CRE) will be discussed in DIE Component 4. By analogy, we can compare the acquisition process to our sensory nervous system, which is made of cells that collect external sensory signals for distillation, filtration, and conversion into stimuli stored in the brain, before motor signals are instantiated and sent to the attack area. Fig. 10.7 shows the overall four stages of the system, including the three vital knowledge engines (Cyber Attack Database, Virus Knowledge Database, Vaccine Database) that go through deep learning process before the system can generate credible prediction.
Anatomical composition of Digital Immunity Ecosystem
301
FIGURE 10.7 Causality is the relationship between causes and effects. An attack causes an emergent need for defense. An inference engine cycles through three sequential steps: match rules, select rules, and execute rules. The deep learning algorithm plays a crucial role in providing a well-educated estimate. The execution of the rules will often result in new facts or goals being added to the knowledge base, which will trigger the cycle to repeat. This cycle continues until no new rules can be matched.
Component 3: reverse engineering center The center is responsible for decomposing all attacking unknown cyberviruses and learning everything about their code and technologies. The cyberpathology reports on captured and quarantined viruses will be cataloged and stored in the virus knowledge base (ViKB). Information coming from other forensics centers will also be stored in the ViKB. Meanwhile, the corresponding antivirus vaccines will be stored in the vaccine knowledge base (VaKB). The CCC will receive daily bulletins from the reverse engineering center. The SVC is responsible for generating the Smart Vaccine Nanobots and sending them to the city nanogrid as a response to cyberattacks. Tools of reverse engineering are categorized into debuggers or disassemblers, hex editors, and monitoring and decompiling tools: 1. Disassemblers: A disassembler is used to convert binary code into assembly code and also used to extract strings, imported and exported functions, libraries etc. The disassemblers convert the machine language into a user-friendly format. There are different dissemblers that specialize in certain things.
302
Chapter 10 Fusing DNA with digital immunity ecosystem
2. Debuggers: This tool expands the functionality of a disassembler by supporting the CPU registers, the hex duping of the program, view of stack, etc. Using debuggers, the programmers can set breakpoints and edit the assembly code at run time. Debuggers analyze the binary in a similar way as the disassemblers and allow the reverser to step through the code by running one line at a time to investigate the results. 3. Hex editors: These editors allow the binary to be viewed in the editor and change it as per the requirements of the software. There are different types of hex editors available that are used for different functions. 4. Portable executive and resource viewer: The binary code is designed to run on a window-based machine and has a very specific data, which tells how to set up and initialize a program. All the programs that run on windows should have a portable executable that supports the DLLs the program needs to borrow from.
Component 4: the causality reasoning and predictor The reasoning engine is the artificial intelligence (AI)ecentric smart guy component of DIE and often referred to as a knowledge-based system. DIE refers to it as the early warning predictor, which is an inference (reasoning) engine that relies on deep learning and Bayesian network models to generate probabilistic attack forecasts. Weather forecasting is a similar application to predict the state of the atmosphere for a given location and time. Weather forecasts are made by collecting (first step) quantitative weather data from weather satellites about the current state of the atmosphere on a given place and using scientific understanding of atmospheric processes to project how the atmosphere will change. The second step is to enter these data into a mathematical weather model to generate credible prediction results. Inference etymologically, the word infer means to “carry forward.” It is a conclusion reached based on evidence and reasoning. We can say inference is made up of three serial processes: deduction, induction, and distinction. Here is a further explanation: First step: Match the rules: The inference engine finds all of the rules that are triggered by the current contents of the knowledge base. In forward chaining, the engine looks for rules where the antecedent (left hand side) matches some fact in the knowledge base. In backward chaining, the engine looks for antecedents that can satisfy one of the current goals. Second step: Select the rules: The inference engine prioritizes the various rules that were matched to determine the order to execute them. Third step: Execute the rules: The engine executes each matched rule in the order determined in step 2 and then iterates back to step 1 again. The cycle continues until no new rules are matched. The AKB includes all facts about the cyberattack episodes, the cybercriminal profiles, victims, and impact of the attack. The world was represented as classes, subclasses, and instances, and assertions were replaced by values of object instances. The rules worked by querying and asserting values of the objects. In addition to the input source of the AKB, ViKB and the VaKB are the other two significant knowledge bases that participate in the reasoning process. Fig. 10.7 illustrates the four interrelated phases of the intelligent process of the DIE.
Machine learning component Fig. 10.7 The knowledge data acquisition component is responsible for the distillation of raw data. The causality reasoning and early warning predictor are the heart of CEWPS/SV. They extract information from the AKB, the VaKB, and the ViKB before the system generates credible warnings. Machine learning (ML) is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech
Anatomical composition of Digital Immunity Ecosystem
303
recognition, effective Web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress toward human-level AI. The reasoning and prediction engine of DIE relies heavily on ML and deep learning. In deep learning, the algorithm learns from past attacks using the vaccine knowledge engine and the virus knowledge engine, and with state-of-the-art accuracy selects the right prediction, which can sometimes exceed human-level performance.
Component 5: smart city critical infrastructures When we talk about smart cities, the critical infrastructures we are referring to are two levels of criticality: (1) Criticality of the structure itself (2) Criticality of the system (software, hardware, and operating data) that manages the structure That is the reason we included the critical infrastructure in the ecosystem. Figs. 10.8e10.11 supplement the panoramic view of intelligent cyberdefense, which includes the following: 1. The Smart Vaccine Grid is the cognitive defense system of all the critical systems in the city. 2. The City Smart Grid is the common grid to connect the critical systems. 3. The city coordination center and smart vaccine coordination centers are tightly connected to collaborate to monitor and respond to any city attack in time.
FIGURE 10.8 All the systems of the critical infrastructures are tied to the city smart grid. Whenever there is an attack, the smart vaccine fighters will respond, and minimal damage will incur. Some systems might be compromised by DDoS, but the city smart grid will take immediate action and notify and protect most of the inoculated systems. DDoS, distributed denial of service.
FIGURE 10.9 In the domain of terrorism, the term “critical infrastructure” became of paramount importance. In fact, certain national infrastructures are so vital that their incapacity or destruction would have a debilitating impact on the defense or economic security of the country. The 11 critical systems that manage the infrastructure in a smart city are ranked by criticality. There are six criticality levels, and Power & Energy got the award for being the most catastrophic in case of a city or regional blackout.
FIGURE 10.10 The critical systems and their information repositories are traced, by location and criticality, on the city smart grid and monitored by the city central coordination center. They are also connected to binary data complex (in gray) indexed by location, application, and time, before they are stored on the DNA USB.
Anatomical composition of Digital Immunity Ecosystem
305
FIGURE 10.11 DIE by design comes with two smart screens. The top one is called the DIE Smart Nanogrid, which is the protective grid; the bottom one is the City Smart Nanogrid. When a hostile DDoS attack is aimed at the city grid, the early warning predictor has already notified the smart grid commander. The attack may be distributed all over the city, but the smart vaccine army will nullify the attack. Meanwhile, all the critical systems were alerted and inoculated. DDoS, as the name implies, is a distributed attack and will try to compromise as many vulnerable systems as on the grid. DIE, Digital Immunity Ecosystem; DDoS, distributed denial of service.
Component 6: the smart vaccine center One of the great contributions to humanity was the discovery of the vaccine. Without adaptive immunity, one-fourth of the human race would have been terminally ill. Because with the exponential acceleration of technology, it is like a runaway train with no control. The whole world needs quality of life, not just smart cities. Digital immunity will offer similar contributions to the digital world. The Smart Vaccine (SV) is one of the most fascinating services that technology could offer to win the battle against cybercrime and cyberterrorism. The SV is built with two avant-garde and innovative technologies: AI and nanotechnology (NT). The SV was conceived with futuristic features that will be ready in a decade. SV has seven wellprogrammed components, as shown in Fig. 10.12, that work autonomically in any difficult battle situation against adversary nanobots, pretty much like humans’ B cells. NT also offers incredible advantages in the design of the Nano Smart Vaccine, which is the new generation of smart weaponry against cybermalware.
306
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.12 The Nano Smart Vaccine is the five-star general running a huge army that defeats any adversary attack.
Component 7: the vaccine knowledge base The VaKB is the intelligent “pharmacy” that has all prescriptions of all possible vaccines that were manufactured for previous attacks. It works very closely with the causal reasoning engine. Further explanation will be provided in the next section.
Anatomical composition of Digital Immunity Ecosystem
307
Component 8: the virus knowledge base The ViKB is the repository that contains all the attack payloads, attack descriptions, the source, and expected target. It works very closely with the causal reasoning engine. Further explanation will be provided in the next section. The VaKB and ViKB are critically important to the overall security of the smart grid, as shown in Fig. 10.13. There will be situations where the virus is not available in the knowledge base and its matching vaccine is not ready, and then the infected system will provide samples of the virus and a vaccine will promptly be fabricated for the rest of the systems on the grid.
FIGURE 10.13 The parallelism between virus reverse engineering (once it is caught for forensics) and vaccine processing is uncanny and fascinating. Using Bayesian network visualization reasoning, CEWPS comes up with the amazing predictions to vaccinate the critical systems before the attack spreads to the other critical systems. CEWPS, Cognitive Early Warning Predictive System.
308
Chapter 10 Fusing DNA with digital immunity ecosystem
Component 9: CEWPS smart nanogrid We are now leaving the fourth generation of NT and entering the fifth one. Let us not forget Ray Kurzweil’s words in his famous book The Singularity is here: The story of the destiny of the human machine civilization, a destiny we have come to refer to as singularity. We are going to achieve marvelous things including nanosystems that will outperform our present computing hardware and software. Nano could replace the current technology, which sends data through metal lines, with metallic carbon nanotubes, which conduct electricity better than metal. The grid has a massive network of pipelines for messages and transport services, as shown in Fig. 10.14. When information is sent from one core to another, the outgoing electrical signal would be converted to light and travel through a waveguide to another core, where a detector would change the data back to electrical signals. Nanowires‘ thus represent the best defined class of nanoscale building blocks, and this precise control over key variables has correspondingly enabled a wide range of devices and integration strategies to be pursued.
FIGURE 10.14 The secret of CEWPS0 uniqueness is its AI nanogrid. No security system can hold a candle to CEWPS. The cross section of the Smart Nanogrid shows its advanced sophistication. The grid connects everything to everything. Like the nervous system, it receives command messages and responds with action messages. AI, artificial intelligence.
The smart grid model During the design of CEWPS, we looked at the risk issue and how it can be managed best, and we realized that the only way to be ahead of the malware curve is to stay away from conventional technologies and jump into the two leading domains of AI and NT. CEWPS and all its intelligent
The smart grid model
309
components harness these two technologies to fabricate and assemble the “Digital Immunity” environment of the future. We wanted to create a replica of the city with a smart grid over it with a huge variety of connections, different types of devices, sensors, and meter and selected critical infrastructures. The DIE screen in Fig. 10.15 shows how all the critical infrastructure systems of the city of Dubai are mapped into the smart grid with the exact coordinates. All attacks will be displayed on the grid at the exact location in real-time mode. DIE CCC will be alerted, and Smart Vaccine Nanobots will be mobilized and will contain the attack and carry adversary hostage nanobots back to the ViKB.
FIGURE 10.15 DIE has a great feature that no other system has, which is highly animated graphical “real-time RADAR like” representation of the City Smart Nano Grid during an attack and how the Smart Vaccine Nanobots are neutralizing the attackers. DIE, Digital Immunity Ecosystem.
The DNA component Our digital universe is slipping deeper into an ocean of silicon technology, and now, we are adding more and more digital devices to the equation, from the new crop of video games and automobile GPS systems to programmable implants. Let us not forget the intelligent drones that are of great service in providing enemy intelligence attacks “singularity” precision attacks. RFID chips for marathon runners and Bluetooth sunglasses. We are building hyperscale data centers that consume more energy than middle-scale cities. And yet everyone is talking about green this and green that, and all this rubbish. Social media has jumped to the front burner and become an open stage for scandals and dirty linen. So far, everyone is enjoying the storage ride with no concern about shortage. Even gargantuan institutions such as Google, Amazon, Facebook, and other big-league giants are selling everyone the cloud vibes, which make more dents on the Internet and telecom. Their gauge of success is how many billions of dollars the CEO makes.
310
Chapter 10 Fusing DNA with digital immunity ecosystem
DNA is a logical component of the digital immunity ecosystem because it offers several advantages over magnetic storage such as solving complex scientific problems, an immense capacity beyond limit, green energy, and it will last forever with damage. The data storage will also exponentially increase, and DNA will be the right platform to offer firstquality encryption and hyperscale capacity. We will be talking about DNA storage in the following chapters. Fig. 10.16 shows how DNA storage and retrieval works. In the next few years, DNA storage will be globalized and will bury the magnetic era with respect.
FIGURE 10.16 Consider DNA as another digital storage device. Instead of binary data being encoded as magnetic regions on a hard drive platter, strands of DNA of 96 bits are synthesized, with each of the DNA bases (TGAC) using the formula (T and G ¼ 1, A and C ¼ 0). To read the data stored from DNA, you slice the DNA string in the order of the binary sequence and convert each of the TGAC bases back into binary. Then the binary string gets converted into its native digital file. The eight steps outlined in the image show the storage and retrieval process.
How did encryption get into DNA? It is fair to say that when an advertisement describes a septic tank as “the best invention since the wheel,” we have begun to take our round, load-bearing companion for granted. Throughout history, most inventions were inspired by the natural world. The idea for the pitchfork and table fork came from forked sticks, the airplane from gliding birds. But the wheel is 100% Homo sapiens’ innovation. As Michael LaBarberada professor of biology and anatomy at the University of Chicagodwrote in a 1983 issue of The American Naturalist “only bacterial flagella, dung beetles and tumbleweeds come close and are ‘wheeled organisms’ in the loosest use of the term, since they use rolling as a form of locomotion.” The first wheels were not used for transportation. Evidence indicates they were created to
DNA cryptology
311
serve as potter’s wheels around 3500 BC in Mesopotamiad300 years before someone figured out to use them for chariots. See https://evbio.uchicago.edu/site/news/michael_labarbera_uchicago_ magazine/ for more information about Michael LaBarbera.
Cryptology evolution of time The miracle of cryptology in DNA Without cryptology, the world could have been flat from carnage. The history of encryption is the history of “the contest of wits” between encryption developers and encryption code breakers. Each time a new encryption algorithm is created, it has been decrypted, which in turn has led to the creation of a new encryption algorithm, and cycles of algorithm creation and decryption have been repeated to this day. Hieroglyphics (pictograms used in ancient Egypt) inscribed on a stele in about 3000 BC are considered the oldest surviving example of encryption. Hieroglyphics were long considered impossible to ever read, but the discovery and study of the Rosetta Stone in the 19th century was the catalyst that made it possible to read hieroglyphics. Some clay tablets from Mesopotamia somewhat later are clearly meant to protect informationdone dated near 1500 BCE was found to encrypt a craftsman’s recipe for pottery glaze, presumably commercially valuable. Building on the work of Polish cryptologists, Bletchley ParkdBritain’s main decryption establishment during the WWIIdwas set on decrypting the Enigma machine, a series of related electromechanical rote cipher machines used by the Nazi military. It was considered unbreakable, as the Nazis changed the cipher every day. The Bletchley Park team, which included the father of modern computing, Alan Turing, capitalized on the machine’s one fundamental flaw: no letter could be encrypted as itself. Armed with this information and Turing’s Bombe machine, which greatly reduced the time required to crack Enigma, pretty soon the Allied forces knew the Wehrmacht’s every move. So, in summary, the encryption scheme, the intended information or message, referred to as plaintext, is encrypted using an encryption algorithmda cipherdgenerating ciphertext that can be read only if decrypted.
DNA cryptology DNA cryptography will be the information carrier that will use modern biotechnology as a measure to transfer plaintext into ciphertext. Biotechnology plays an important role in the field of DNA cryptography. DNA immersed with cryptologic technology is as impenetrable as the Mont Saint Michel Abbey. We will also cover some of the DNA biotechnology and software of the field of DNA. DNA encryption is an ingenious biological technique for securing text/image because of its parallelism, vast storage, and fast computing quality. The cryptology process starts with a module of synthetic DNA where the base pairs will be used in encryption/decryption and an algorithm will be formulated using the base pairs A (adenine), C (cytosine), T (thymine), and G (guanine). ACTG characters will create the DNA sequence (S) and is merged with message (M) to produce new sequence (S0 ) and send it to its receiving destination where sequence (S0 ) will then be converted back into S.
312
Chapter 10 Fusing DNA with digital immunity ecosystem
The encryption algorithm The following is the sophisticated encryption algorithm method that uses DNA coding to increase encryption strength, as shown in Fig. 10.17: Step 1: We select the binary file of the critical system to be encrypted. The binary is in ASCII decimal format (0 and 1). Step 2: The binary data are grouped into four blocks and encrypted using a traditional Data Encryption Standard (DES), which is known as the symmetric key algorithm. Step 3: Then, we convert the encrypted file to binary format. Step 4: Then, we group the binary-encrypted file group into two blocks and convert it into DNA code: as A for 00, T for 01, G for 10, and C for 11. Step 5: We then add primer/stopper on either side of this message. Primers will act as stoppers and detectors for the message. This must be done before sending the encrypted message across the grid. Step 6: Then, we add to the original encrypted file an additional DNA sequence prepared by the sending encrypted followed by another primer/stopper. Step 7: We add several additional DNA sequences on both sides of the original encrypted file, or we confine the message to a microdot in the microarray (secure holding place). Step 8: The final compounded sequence is sent to the destination across the grid. DNA data storage (DDS) is already encrypted as a jumble of ACTG; however, hackers can use sequencers to convert the coded messages to plaintext. Data encryption becomes necessary to protect the DIE biodata. Here is another efficient approach to implementing cryptography in DNA, for the tightest security and impossible to crack: For cryptographic purposes, the interest is to generate very long one-time pad (OTP)dsee (Appendix A)das cryptographic keys, which ensures the unreachability of a cryptosystem, to convert the classical cryptographic algorithms to DNA format to benefit the advantages that DNA offer and to find new algorithms for biomolecular computation (BMC), as DNA XOR is an XOR gate that implements the exclusive OR.
FIGURE 10.17 This diagram represents the advantage of storing the critical system operation data on DNA in encrypted format, with two rounds of encryption and storing the file in separate blocks. Furthermore, some “dummy DNA” sequences are tagged to the original DNA file. It makes hacking extremely challenging, if not impossible. Copyright [MERIT CyberSecurity Group]; All rights are reserved.
Malware going after DNA storage
313
Malware going after DNA storage DNA computing DNA computing is a subarea of molecular computing and an exciting fast developing interdisciplinary area. DNA computing is a novel and fascinating development at the interface of computer science and molecular biology. It has emerged in recent years, not simply as an exciting technology for information processing, but also as a catalyst for knowledge transfer between information processing, NT, and biology. This area of research has the potential to change our understanding of the theory and practice of computing. The DIE is the new paradigm that will replace AVT, which is suffering from no advancement tracking on the top of the plateau. DIE is complemented with a formidable component that will also change the computing worlddjust give it few years. Fig. 10.18 shows the city of Dubai equipped with intelligent grid, and all the operating critical systems data will be converted into DNA storage banks. DDS is a subset of DNA Computing. The computing world today is getting close to saturation in execution speed and data storage. Dr. Adleman is our Thomas Edison of the 21st century.
FIGURE 10.18 AI technology and DNA computing were bonded together to engineer this futuristic cybersecurity ecosystem. The city’s critical systems are securely running and storing generated information in conventional storage. The City Smart Grid and the Smart Vaccine Grid keep these systems operating without downtime. Harnessing DNA for digital storage is revolutionary. The digital universe is running out the magnetic storage. DNA offers significant advantages, which will be used in DIE. AI, artificial intelligence; DIE, Digital Immunity Ecosystem.
Operations on DNA are massively parallel: A test tube of DNA can contain trillions of strands. Each operation on a test tube of DNA is carried out on all strands in the tube in parallel. Think of enzymes as hardware, DNA as software. The nascent field of DNA computing got started in 1994 with
314
Chapter 10 Fusing DNA with digital immunity ecosystem
an article in Science magazine by a leading computer theorist, Dr. Leonard Adleman of the University of Southern California in Los Angeles. Dr. Adleman explained how a problem by synthesizingd combiningdDNA molecules with a sequence and solved it by letting the DNA molecules react in a test tube, producing a molecule whose sequence is the answer. Dr. Adleman solved the of the traveling salesman problem for seven cities within a second, using DNA molecules in a standard reaction tube. However, the most advanced supercomputers would take years to calculate the optimal route for 50 cities. Adleman dubbed his DNA computer the TT-100, for test tube filled with 100 mL, or about 1/50th of a teaspoon of fluid, which is all it took for the reactions to occur.
DNA computing applications A gram of DNA contains about 1012 terabytes! DNA will be strategically useful for storing DIE critical systems operating data. Progress in the development of molecular computers may lead to a “Doctor in Cell,” which is represented by a biomolecular computer that operates inside the living organism, for example, the human body, programmed with medical information to diagnose potential diseases and produce the required drugs on site. This will ultimately lead to a device capable of processing DNA inside the human body and finding abnormalities and creating healing drugs. Fig. 10.19 shows the complete DNA computing tree and all the different advances on modern genetic medicine.
FIGURE 10.19 The area of DNA computing holds great potential to be explored due to its applications in many various fields. DNA computing operates in natural noisy environments, such as a glass of water. Along with all the remarkable new applications, we are going to focus on Smart City Critical Systems Data (highlighted in the dashed area).
DNA computer We mentioned earlier that DDS will be a great value-added component for DIE. To understand DDS, we need to understand how DNA computer does it magic. DNA computer is an integral part of DNA
Dubai digital data forecast
315
Computing, which is an interesting concept that utilizes biological structure of DNA to design a new type of computing. The concept of DNA computing was introduced in 1994 by USC professor, Leonard Adleman who showed that DNA could be used to store data and even perform computations in a massively parallel fashion. Using the four bases (blocks) of DNA (adenine, thymine, cytosine, and guanine), Adleman encoded the classic problem known as the traveling salesman problem into strands of DNA and utilized biological properties of DNA to find the answer. Performing millions of operations simultaneously allows the performance rate of DNA strands to increase exponentially. Adelman’s experiment was executed at 1014 operations per second, a rate of 100 teraflops (100 trillion floating point operations per second). The world’s fastest supercomputer runs, by comparison, are just 35.8 teraflops. Traditional storage media, such as videotapes, require 1012 cubic nanometers of space to store a single bit of information; DNA molecules require just one cubic nanometer per bit. In other words, a single cubic centimeter of DNA holds more information than a trillion CDs. This is because the data density of DNA molecules approaches 18 million bits per inch, whereas today’s computer hard drives can only store less than 1/100,000 of this information in the same amount of space.
Case study: demographic and data storage growth of Dubai Racing to be number one in demographic growth, Dubai is an astonishing metropolis. The city does not have any significant history. The Romans, the Egyptians, or the Crusaders, never reached this part of land. Dubai is famous for sightseeing attractions such as the Burj Khalifa (the world’s tallest building) and shopping malls that come complete with mammoth aquariums and indoor ski slopes. Now Dubai wants to be the hub of technology and innovation. The DIE/DDS futuristic engineering is to be exhibited in Expo 2020. If we look at the timeline of the city since its birth, we realize there is no city in the world that has reached the excellence of Dubai’s singularity in 47 years. On World Population Day, Dubai’s population has surpassed 3 million people, one-third of UAE’s 9.3 million. And by 2027, Dubai is expected to jump to 7 million people. The meteoric growth of people brings several criticalities that will raise eyebrows: Criticality polynomial ¼ (population growth) (infrastructure complexity) (growth of critical systems) digital storage growth) (system security) If we wanted to tie risk to criticality, we have the following formula: System criticality ¼ failure frequency per year damage ($) ¼ risk (damage per year)
Dubai digital data forecast We are going to focus on storage growth and security risk because we could assess the criticality with number and charts.
316
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.20 Dubai is used for our demographic model. Data were provided by Dubai Chamber of Commerce, which is the central authority to administer the city’s operational data.
Dubai is growing with asymptotic direction. When we look at Dubai urban performance, we are impressed with the progress and future planning. Fig. 10.20 shows the cities’ operational data for 2017. According to the Dubai Chamber of Commerce, which we can consider as the “Ministry of Information” in Dubai, we took the following interesting chart for 2017: • • • • • • • •
Grand total of companies: 43,789 companies (issued certificates in 2017) Average servers: 55,000 servers Estimated (IoT) devices connected to private servers: 15 million devices Estimated number of subscribed homes: 23,000 Estimated information storage daily (each of the 15 million devices stores 10%). Ten percent of user data are stored: 1.5 million files (30 gigabytes per file) ¼ 45 1015 bytes per day. Dubai infrastructure critical systems (grouped in 13 categories as defined by DIE): 1000 systems Each critical system stores critical production data: 100 gigabytes per day. These data are high critical and have high risk of cyberattacks. Dubai business growth: The website www.dubai.ae portal shows how solid and steadily growing the economy is in the city. It lists the nine remarkable key advantages of the city.
Unless we seriously address the storage requirements for a dynamic smart city like Dubai, there will be a major shortage of storage, which will handicap all the systems in the city, as illustrated in Fig. 10.21.
Appendices
317
FIGURE 10.21 Statistics shows how population, critical systems, Internet of Things, and storage requirements are growing and are expected to grow by 2027.
Fig. 10.22 clearly shows how the demographic increase will impact the utilization of the critical systems, and consequently, the storage requirements will go asymptotic. Storage scalability reaches zero, whereas storage criticality will be exponential.
Appendices Appendix 10.A Cryptography: one-time pad In cryptography, the one-time pad (OTP) is an encryption technique that cannot be cracked but requires the use of a one-time preshared key the same size as, or longer than, the message being sent. In this technique, a plaintext is paired with a random secret key (also referred to as an OTP). Then, each bit or character of the plaintext is encrypted by combining it with the corresponding bit or character from the pad using modular addition. The resulting ciphertext will be impossible to decrypt or break if the key is 1. truly random, 2. at least as long as the plaintext,
318
Chapter 10 Fusing DNA with digital immunity ecosystem
FIGURE 10.22 This graph shows how the population of Dubai will impact the complexity index of the city. Something must be done before 2024. The S-curve is the progress of DNA storage when backup and archiving migrate to DNA.
3. never reused in whole or in part, and 4. kept completely secret. It has also been proven that any cipher with the perfect secrecy property must use keys with effectively the same requirements as OTP keys. Digital versions of OTP ciphers have been used by nations for some critical diplomatic and military communication, but the problems of secure key distribution have made them impractical for most applications.
Appendix 10.B DNA cryptography with data encryption standards DNA computing nowadays works as one of the most important cryptosystems because of its exclusive parallelism and massive data density characteristics. Most of the cryptosystems are based on mathematical equations that represent Krypto-keys. The harder these mathematical equations, the harder the attacker can have access to the system. There are many cryptographic systems in use nowadays. One of the most known cryptographic systems is the Data Encryption Standards (DES), created by IBM in 1976, which can be used to produce 64-bit ciphertext from 64-bit plaintext by using 56-bit secret key. However, using the parallel processing of DNA computing is more advanced than what DES is doing. NIST, using a secret key and efficient DNA method, was able to break the DES key within a day. The same goes for the RSA cryptographic method that has a reliability problem as it is based on factoring large numbers and then breaking the key.
Appendices
319
OTP is one of the algorithms that are still secure and reliable even while comparing to DNA computing. OTP is a private key encryption algorithm, which is completely secure theoretically, but practically it has some challenges in the generation and distribution of keys. The idea of OTP is that the encryption of each bit or character in the plaintext is done by a modular addition, which means that each bit or character would get a bit or a character from a random key generator. This randomness makes it impossible to break the ciphertext in theory. To have this kind of security and reliability, each secret key should not be used more than one time in the algorithm, which is considered one of the main disadvantages of using OTP. One other disadvantage is that each pad sequence should be truly random and unpredictable, and the length of this pad sequence should be at least equal to the number of bits in the message. Another DNA encryption algorithm has developed using chaos theory. The developed algorithm works for both the text and image encryption and adds some advanced features to improve the OTP algorithm, which is considered as a reliable data encryption algorithm. Concerning the text encryption, the developed algorithm has introduced a chaotic selection between either the DNA strands or the OTP DNA strands. By the advancement of technology, the traditional cryptography systems could not achieve the required level of security to face the advanced attackers. So, DNA was one of the solutions for this problem because of its ability to contain larger information sets than any other traditional technique. The DNA encryption algorithm provides more security, scalability, and robustness than the traditional cryptosystems because of the two stages of encryption used. The ciphertext is generated by encrypting the plaintext on two stages. The first stage is by using a randomly generated key, which works as an OTP key. This way, the attacker would face a problem predicting the random numbers used for the key generation, and this would make the system more secure. The second stage is by generating the DNA sequence (by using binary DNA coded scheme) symmetric key, which would be used in the decryption as well. The two-staged encrypted text and the ciphertext are sent over two separate secure channels. The ciphertext is also transferred as a DNA sequence to generate the final ciphertext. In the developed algorithm, the DNA is used as a carrier data, which makes the system more secure as it makes it too difficult for the attacker to guess and have access to the symmetric key while facing the biological difficulty of the DNA sequence. Despite being more robust, secure, and scalable, one of the main challenges of this algorithm is the computational complexity it may produce. DNA cryptography has been developed, studied, and explored more in recent years. Most of the studies and research done on DNA cryptography are centralized on DNA sequences used for binary data encoding. Despite the complexity of the DNA cryptography and being in the first stages of development, it is expected that using DNA in cryptography would be a great advance for information security. Using this cryptography system makes use of the advantages of the classical cryptography systems while reducing their risks and restrictions. The randomness of the generated DNA sequences adds many advances and ease to the science of cryptography. For example, there is no need to send long keys. Instead, keys can be a DNA sequence that has a unique identification number.
Appendix 10.C DNA cryptography with advanced encryption standard The National Institute of Standards and Technology (NIST)da government agency in the Department of Commerce formed in 1977dneeded an alternative to the data encryption standards (DES), which were then prone to attacks because of advances in processing power of the systems. The selected algorithm was designed by the Belgian Vincent Rijmen. Later in the year 2000, the algorithm was formally adopted with the name advanced encryption standard (AES). In AES-128, there will be 10 rounds of operations that require 10 different sets of keys that are generated using a process called
320
Chapter 10 Fusing DNA with digital immunity ecosystem
key expansion. The round keys, which are required for the encryption process, will be generated using random number generation unit (TRNG). For further advancement, DNA encoding is used along with the TRNG design, which will produce the complete 128-bit of key required.
Appendix 10.D Advanced encryption standard algorithm Advanced encryption standard (AES) uses a symmetric key that means the encryption key is the same for encryption and decryption. In AES, the input block is fixed with 128 bits, whereas the key size varies among 128, 192, or 256 bits. We name AES-128 for the algorithm with 128-bit key, similarly for the AES-192 and AES-256. In AES, the 128 bits of input will be made into a group of 16 bytes per block. The arrangement of the blocks will be in a matrix form consisting of four rows and columns. AES will do the calculations in bytes instead of bits. For encrypting, the number of iterations is not fixed as in DES, but relies on the size of the key used. AES uses 10 rounds for a 128-bit key, 12 rounds for a 192-bit key, and 14 rounds for a key of 256 bits. Every iteration during encryption gets a new set of keys called round keys, which is calculated from the initial present data block and previous and attached serially together. Decryption is simply an inverse of the encryption process; all the steps that are done for encryption are done in reverse order to decrypt the cipher data. The final round is decrypted first, and the rest of the blocks are decrypted in reverse order. In the AES algorithm, the first round of operation is to combine the XORed input plaintext block with the given cipher key. Each of the nine (for 128-bit encryption) encryption rounds will have four stages, but the final round will have only three stages. The same applies for decryption as well. The four stages of operation are as follows: 1. 2. 3. 4.
Substitute bytes Shift rows Mix columns Add round key
Appendix 10.E DNA encoding DNA cryptography is one of the fastest-improving innovations, which takes a shot at ideas of DNA processing. Another procedure for securing data was shown utilizing the cellular structure of DNA called DNA computing. DNA can be used to store and transmit information. The idea of utilizing DNA computation in the fields of cryptography has been distinguished as a conceivable innovation that may present another desire for unbreakable algorithms. DNA strands are long polymers of a million number of connected nucleotides (DNA blocks). These nucleotides comprise one of four nitrogen bases, a five-carbon sugar and a phosphate gathering. The nucleotides that make up these polymers are named after the nitrogen base that it comprises ACGTdadenine, cytosine, guanine, and thymine. Speed, less storage, minimal power requirements, and the like are some of the advantages of DNA encoding. In DNA coding, we can see the input information that has to be encoded in characters. The encoding unit takes this input data and generates a triplet code, which will include a combo of three bases of DNA. The design of AES with random number generation (TRNG) is used for generation of the round keys required for the encryption and decryption process. Also, the DNA-based TRNG
Appendices
321
module is implemented to increase the security level further. The implementation is based on mathematical properties of the Rijndael algorithm, where the key expansion part relies on the DNA coder and TRNG. Encryption design using shift rows, mixed column, add round key, and design of decryption part are over. The genetic algorithm based on encoding for key generation is used for the encryption and decryption processes. The new design permits the construction of efficient area and speed characteristics, while keeping a very high protection level. We conducted relevant AES implementations with DNA TRNG for key generation method. With this novel approach of generating, the purely random keys cannot be guessed, and the key entry process will no longer be a manual process. All the required sets of 128-bit keys will be generated randomly. The level of security against various attacks will be increased using this method. Also, from the comparison table (with and without TRNG), we can see that the area and the delay, when compared with the conventional AES implementation, has been reduced. The design has an enormous scope of improvement; the TRNG block can be implemented using various other methods. A true source of randomness can be added, and there are a variety of true sources of randomness available, which may help in further optimizing the design.
Appendix 10.F Cryptology glossary (From MERIT CyberSecurity library) A-label: The ASCII-compatible encoded (ACE) representation of an internationalized (Unicode) domain name. A-labels begin with the prefix xn. To create an A-label from a Unicode domain string, use a library like India. Asymmetric cryptography: Cryptographic operations where encryption and decryption use different keys. There are separate encryption and decryption keys. Typically, encryption is performed using a public key, and it can then be decrypted using a private key. Asymmetric cryptography can also be used to create signatures, which can be generated with a private key and verified with a public key. Authentication: The process of verifying that a message was created by a specific individual (or program). Like encryption, authentication can be either symmetric or asymmetric. Authentication is necessary for effective encryption. Bits: A bit is binary valueda value that has only two possible states. Typically, binary values are represented visually as 0 or 1, but remember that their actual value is not a printable character. A byte on modern computers is 8 bits and represents 256 possible values. In cryptographic applications when you see something say it requires a 128-bit key, you can calculate the number of bytes by dividing by 8. 128 O 8 ¼ 16, so a 128-bit key is a 16-byte key. Bytes-like: A bytes-like object contains binary data and supports the buffer protocol. This includes bytes, byte array, and memory-view objects. Ciphertext: The encoded data, it is not user readable. Potential attackers are able to see this. Ciphertext indistinguishability: This is a property of encryption systems whereby two encrypted messages are not distinguishable without knowing the encryption key. This is considered a basic, necessary property for a working encryption system. Decryption: The process of converting ciphertext to plaintext. Encryption: The process of converting plaintext to ciphertext. Key: Secret data are encoded with a function using this key. Sometimes, multiple keys are used. These must be kept secret; if a key is exposed to an attacker, any data encrypted with it will be exposed.
322
Chapter 10 Fusing DNA with digital immunity ecosystem
Nonce: A nonce is a number used once. Nonce is used in many cryptographic protocols. Generally, a nonce does not have to be secret or unpredictable, but it must be unique. A nonce is often a random or pseudorandom number (refer to random number generation on the web). Since a nonce does not have to be unpredictable, it can also take a form of a counter. Opaque key: An opaque key is a type of key that allows you to perform cryptographic operations such as encryption, decryption, signing, and verification but does not allow access to the key itself. Typically, an opaque key is loaded from a hardware security module (HSM). Plaintext: User-readable data you care about. Private key: This is one of two keys involved in public key cryptography. It can be used to decrypt messages that were encrypted with the corresponding public key, as well as to create signatures, which can be verified with the corresponding public key. These must be kept secret; if they are exposed, all encrypted messages are compromised, and an attacker will be able to forge signatures. Public key: This is one of two keys involved in public key cryptography. It can be used to encrypt messages for someone possessing the corresponding private key and to verify signatures created with the corresponding private key. This can be distributed publicly, hence the name. Symmetric cryptography: Cryptographic operations where encryption and decryption use the same key. Text: This type corresponds to Unicode on Python 2 and str on Python 3. This is equivalent to six.text_type. U-label: The presentational Unicode form of an internationalized domain name. U-labels use Unicode characters outside the ASCII range and are encoded as A-labels when stored in certificates.
Suggested readings https://www.scribd.com/document/358965603/An-Efficient-VLSI-Design-of-AES-Cryptography-Based- on-DNATRNG-Design. DNA Computing www.researchgate.net/publication/303298036_DNA_Computing_Made_Simple/. DNA Computing http://www.authorstream.com/Presentation/vire7-2062598-dna-computing/. DNA computing on a chip https://www.researchgate.net. DNA cryptography https://www.slideshare.net/mayukhmaitra/dna-cryptography. DNA with silicon https://www.researchgate.net/publication/320182637_PROPOSED_ARCHITECTURE_ FOR_ COMPARISON_OF_DNA_COMPUTING_WITH_SILICON_COMPUTATION. Termanini, Dr.R., 2016. The Cognitive Early Warning Predictive System Using the Smart Vaccine. CRC Press. Termanini, Dr.R., 2018. The Nano Age of Digital Immunity Infrastructures and Applications. CRC Press. Encryption in DNA https://www.slideshare.net/Abhishektit/a-new-approach-towards-information-securitybased-on-dna-cryptography/, Secure data transmission DNA http://www.authorstream.com/Presentation/vire7-2062598-dna-computing/Seige of Caffa https://contagions.wordpress.com/2012/06/28/plague-at-the-siege-of-caffa-1346/. Smolinski, T.G., Milanova, M.G., Hassanien, A.-E., 2008. Computational Intelligence in Biomedicine and Bioinformatics, first ed. Springer. Corr 2nd printing 2008 edition (April 6, 2011).
CHAPTER
DNA storage heading for smart city
11
The best way to predict the future is to design it. dPeter Drucker
The United Nations estimates that by 2050, 66% of the world’s population will be living in cities, up from the 54% in urban areas today. This migration has already been realized in a major way in such cities as Tokyo with 38 million people and Delhi and Shanghai both with more than 25 million people. dUnited Nations Forecast
The greatest single achievement of nature to date was surely the invention of the molecular DNA. dLewis Thomas
I’m fascinated by the idea that genetics is digital. A gene is a long sequence of coded letters, like computer information. Modern biology is becoming a branch of information technology very much. dRichard Dawkins
I know that people keep saying that, but how can that possibly be true? The number of humans is growing only very slowly. The number of digitally connected humans, no matter how you measure it, is growing rapidly. A larger and larger fraction of the world’s population is getting electronic communicators and leapfrogging our primitive phone-wiring system by hooking up to the Internet wirelessly, so the digital divide is rapidly diminishing, not growing. dRay Kurzweil
Introduction We would like to clarify the difference between smart city DNA and DNA storage for DNA. The former term describes the information that circulates in the city between city administration and the citizens. The quality of a city is characterized by its digital infrastructure, the real estate infrastructure, the safety of citizens, and its energy distribution. The latter term is relevant to binary data storage in DNA. By 2025, it is expected that 163 zettabytes (1021 bytes) of new data will be generated, according to a report by IT analyst firm IDC. That estimate represents a potential 10-fold increase from the 16 ZB (1021 bytes) produced through 2016. Smart cities will be impacted by the gluttony of consuming data. Internet of things (IoT) will devour 85% of the computing resources of the city datacenter. The exponential growth of city operating data will create a substantial burden of information Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00011-2 Copyright © 2020 Elsevier Inc. All rights reserved.
323
324
Chapter 11 DNA storage heading for smart city
communication technology (ICT) infrastructure. DNA data storage design and deployment should be an important component during the design or renovation of a smart city. If big, infinite data is key for cities evolving into smart cities, then a question arises as to the hierarchy of data prioritization. In other words, where does a city start? Two points of entry can assist a city in answering this question. The first point of entry is how some cities see the market driving the need for access to certain types of data. Incident reporting, energy usage and analysis, and transportation information are all areas that citizens see immediate value. Other cities position new datacentric tools like social media to assist with better communicating with their citizens. This reactive approach is highly effective when implemented correctly, with many examples from all over the world as best practices and in certain cases, lessons learned. The second point of entry is in the proactive approach of identifying and managing your city’s digital DNA. The building blocks to effectively and efficiently use city data will ultimately reside in a city’s ability to repurpose its existing data and documents associated with the built environment, which is the authenticated digital DNA of all cities. The field of DNA data storage has been exploding ever since, but what is yet to be seen is a fully working prototype that will both write and read DNA at a rate as fast as, yet cheaper than, digital methods of storage. Both industry and investors have picked up the gauntlet, working toward making DNA a commercially viable solution for long-term data storage.
Why do we need molecular information storage? The scale and complexity of the world’s “big data” problems are increasing rapidly. Use cases that require storage and random access from exabytes (1018) of mostly unstructured data are now well-established in the private sector and are of increasing relevance to the public sector. However, meeting these requirements poses extraordinary logistical and financial challenges: today’s exabytescale datacenters occupy large warehouses, consume megawatts of power, and cost billions of dollars to build, operate, and maintain over their lifetimes. This resource-intensive model does not offer a tractable path to scaling beyond the exabyte regime in the future. Intelligence Advanced Research Projects Activity (IARPA) gives the following description from its website https://www.iarpa.gov/index.php/research-programs/mist?id¼1077: “Molecular information storage (MIST) is one of the highly visible programs in the Intelligence Advanced Research Projects Activity (IARPA) organizationda government agency that invests in high-risk/ high-payoff research programs that have the potential to provide the country with an overwhelming intelligence advantage over future adversaries. The goal of the MIST program is to develop deployable storage technologies that can eventually scale into the exabyte regime and beyond with reduced physical footprint, power, and cost requirements relative to conventional storage technologies. MIST seeks to accomplish this by using sequence-controlled polymers (a chain of small molecules) as a data storage medium, and by building the necessary devices and information systems to interface with this medium. Technologies are sought to optimize the writing and reading of information to/from polymer media at scale, and to support random access of information from polymer media archives at scale.”
Smart city needs smart data
325
The search for new storage technologies is hardly new. In recent years the proliferation of datagenerating devices (scientific instruments and commercial IoT) and the rise of AI and data analytics capabilities to make use of vast datasets have boosted pressure to find alternative approaches to storage. Magnetic storage has reached a critical mass and is ready to collapse. The MIST program is expected to last 4 years and be composed of two 24-month phases. The desired capabilities of the phases of the program are described by three technical areas: Technical Area-1 (Storage): Develop a table-top device capable of writing information to molecular media with a target throughput and resource utilization budget. Multiple, diverse approaches are anticipated, which may utilize DNA, synthetic polymers, or other sequence-controlled polymer media. Technical Area-2 (Retrieval): Develop a table-top device capable of randomly accessing information from molecular media with a target throughput and resource utilization budget. Multiple, diverse approaches are anticipated, which may utilize optical sequencing methods, nanopores, mass spectrometry, or other methods for sequencing polymers in a high-throughput manner. Technical Area-3 (Operating system): Use digital immunity ecosystem (DIE)das the operating environmentdwhere DNA storage and retrieval devices will coordinate addressing, data compression, encoding, encrypting, error correction, and decoding of files from DNA in a manner that supports efficient random access at scale. The objective of MIST will be technologies that jointly support endto-end storage and retrieval at the terabyte scale, and which present a clear and commercially viable path to future deployment at the exabyte scale. Collaborative efforts and teaming among potential performers is highly encouraged. “The scale and complexity of the world’s ‘big data’ problems are increasing rapidly,” said MIST program manager, David Markowitz. “Use cases that require storage and random access from exabytes of mostly unstructured data are now well-established in the private sector and are of increasing relevance to the public sector.” Registration for the MIST program was closed on February 14, 2018. Not surprisingly, The IARPA is emphasizing the multidisciplinary nature of the project with expected disciplines: chemistry, synthetic biology, molecular biology, biochemistry, bioinformatics, microfluidics, semiconductor engineering, computer science, and information theory. IARPA is seeking participation from academic institutions and research centers from around the world.
Smart city needs smart data It is real to say that no two “smart” cities are alike. Every smart city has its own design needs and problems to solve, which results in its own specific focus, which can be more on sustainability, privacy, safety, infrastructure, transportation, citizen participation, or even the transformation of governance as such. However, one thing remains the same: every citydrich or poor, old or youngdis in dire need of accurate “reference” data to deal with the numerous challenges it faces, day in and day out. Moreover, thanks to a platform like DataBroker (DAO) https://databrokerdao.com, these data are
326
Chapter 11 DNA storage heading for smart city
FIGURE 11.1 The symbolic hierarchical network of all the components of Smart City. We divided the chart into two layers. The top layer rep-resents the major responsibilities of city management, while the bottom layer shows the responsibilities of the citizens who live in the city. The heavy line represents the impact of DIE/DNA upon the city-level responsibilities. No two cities are alike, each one has its own character and is bundled with its own tradition and culture. The biggest challenge is to apply “smart” technology without burning the city with fatal attractions.
easily accessible for everybody. Fig. 11.1 is a 3D map of all the informational components that smart city needs. A DataBroker can be deployed to gather information about how well the city is doing in regard to its citizens and what the city administration can do to enhance the social needs of citizens in healthcare, education, and security. We included a reference at the end of the chapter for more information about the anatomy of Data Brokers. Let me explain important terms that are relevant to our topic in layman’s language: Software engineers and programmers try subconsciously to impress the business user (client) with high-tech
The smart city will switch from hardware to bioware
327
terms that can be simplified down to their atomic level. Here are new two technical terms used in information exchange between the user and the machine: Data broker is a smart AI-centric smart program that collectsdsynthesizesdinformation from a variety of databases and files; and sequences the information in the requested order. DAO: No, it is not Death on Arrival. It is simply a universal, multiplug adapter that can interface directly with several heterogeneous databases and passes the information to the broker for streaming and formatting. The smart city is a big inverted tree with unlimited branches with a special topology that mirrors the futuristic vision of its leaders. A smart city does not have absolute standards that could be applied by its designers. A scan of the various smart city definitions found that technology is a common element. The cities of the 19th and 20th centuries do not fit into our 21st-century urban blueprints. History tells us that most cities have a 50-year window to become a technology-centric smart city, otherwise, those cities will suffer from chaotic political and economic turbulences and remain behind-progress, stagnant, and draggy. Take, for example, Dubai. It is morphing into a smart city because of the futuristic well-thought-out vision of its leadership. Today, Dubai has firm governance standards and regulations that control all aspects of its explosive urbanism. Other modern cities are speeding into the smart zone, but most of them have hidden political and economic faults that will impede their progress. According to “World Cities Report 2016” published by www.Unhabitat.org (United Nations Human Settlements Program), there are over 50 war-riddled cities in the world that were once the crucible of civilization and societal harmony. We have entered the Age of Smart Cities, where high-performance urban environments are being created by a perfect storm of economic conditions, futuristic thinking, next-generation ICT, and massive urban migration that require new and existing cities to respond with powerful new programs, solutions, and relationships between people, places, and things. This requires not just smart technologies and systems, but smart thinking. The basic goal of smart cities is to improve the quality of life and well-being of its citizens, and as human capital far outweighs any other measure of a successful urban environment. To plan, design, construct, and operate smart cities, there is an emerging need for management tools. Cities are a mirror to the values of our age. Both large and small smart city solutions could assist in creating an urban environment for people to prosper in a welcoming, inclusive, and open manner. When people, places, and things begin to seamlessly and transparently communicate, interesting things begin to happen. This is the promise of smart cities. Getting smart cities right is our generation’s greatest challenge and the best legacy we can leave to our children.
The smart city will switch from hardware to bioware The one thing that we learned from technology, is that change is the only constant thing in this world, and it certainly applies to informatics and computing technology. Data storage is the most pivotal component that impacts information from consuming communities. Information archiving is essential,
328
Chapter 11 DNA storage heading for smart city
FIGURE 11.2 What a contrast between conventional magnetic data storage and the modern DNA data storage. All the massive mountains of hardware and processor racks will be replaced by DNA code. As the saying goes: Necessity is the mother of invention. All the companies that invested in magnetic hardware will have to replace it with DNA Bioware. It is a matter of time. Picture extracted from MERIT Cyber Security knowledge base.
not necessarily for historical purposes but supporting the day-to-day living and taking care of social obligations. The migration into DNA storage environment is the only safe escape from the present data strangulation. Fig. 11.2 is the promising future of information prosperity. Here are some statistics that will make your hair stand up. But before we talk about the digital universe incoming storage strangulation, here’s an episode-we all know about it: Noah and his family built the ark. Noah warned people that the Flood was coming, but they did not listen to him. They continued to do bad things. After the ark was finished, Noah took animals into the ark, and he and his family also went inside. Then Jehovah brought a great storm. Rain fell for 40 days and 40 nights. The water flooded all the earth. dGenesis 7:7e12.
The wicked people lost their lives, but Noah and his family were saved. Jehovah brought them safely through the Flood into an earth cleansed of wickedness. dGenesis 7:22, 23.
The digital universe is undergoing a “big bang” and trends show global exponential growth in datacentersdask Google, Facebook, or Microsoft how much money they are spending on their datacenters. The expansion contagion is spreading onto public and private clouds as well. According to CISCO Global Cloud Index: Forecast and methodology, 2016e21, hyperscale datacenters will grow from 338 in number at the end of 2016 to 628 by 2021. They will represent 53% of all installed datacenter servers by 2021. Google, Facebook, Microsoft, and Amazon are companies with more datacenter facilities around the world. Here are some numbers: • • •
Google has 900,000 servers distributed in 13 datacenters. Amazon has 450,000 servers, 40,000 dedicated to their Web Services division. Microsoft, with over 500,000 servers, has spent over $15 billion on its datacenters.
DNA potentialities
•
329
Facebook stores 100 petabytes of data (1015), around 100, 000,000 GB, in their datacenters.
In enterprises, massive budgets are consumed by the peopledso-called plannersdhired to grapple with the complexity of data storage, yet the battle is still being lost. The term “storage” does not do justice to the value of information. We would like to use the term “safeguarding” instead. In Fig. 11.3, we take all the information generated by the critical systems of the city of Dubai and store it dynamically into DNA libraries. The city will have the necessary systems to recover data back to its original test at any time. To classify datacenters by their energy efficiency, we use the power usage effectiveness (PUE) metric: PUE ¼
Total Facility Energy IT Equipment energy
PUE is determined by dividing the amount of power entering a datacenter by the power used to run the computer infrastructure within it. The average datacenter PUE is about 1.8, which means that a datacenter has an over cost of 80% in activities that are not exclusive of the servers (mainly due to cooling activities). One of the primary goals of using DNA data storage in a smart city will be the focus on minimizing the energy consumption in datacenters. The idea of storing 215 petabytes (215 1015 bytes) in a single gram of DNA is mind boggling. DNA is also extremely durable. By some estimates, if the DNA is kept cool and dry, without being exposed to light or radiation, it could last thousands of years and never become obsolete.
DNA potentialities DNA also has the potential of leading to significant costs savings, in part because it requires much less space and energy to store compared to today’s media, but also because synthesizing and sequencing technologies will continue to grow more efficient while dropping in price, as researchers dig further into DNA’s inner workings. Density: DNA is above all is extremely dense. We can take 200 Petabytes (1015) of data and store it on 1 g of DNA. It is estimated that everyone knowledge on the Web might be simply contained in DNA within the area of a shoe field. Loyalty: Knowledge restoration may be just about error-free as a result of the accuracy of DNA replication strategies. Sustainability: The power required to take care of DNA-encoded data is a small fraction of that required by fashionable knowledge facilities. Longevity: DNA is a steady molecule that may last for hundreds of years without degrading. DNA lasts for an extended period of time, over 100 years, which is orders of magnitude more than traditional media. Try to listen to any disk from the 90s and see if it is still good. The big issue with DNA storage is reading and writing data. The writing is done by Twist; the company can produce custom strings of DNA utilizing a machine it constructed. The firm’s principal clients are research labs that insert customized genetic materials into microbes to produce organisms that may conduct useful chemical processes, such as producing desirable nutrients.
330 Chapter 11 DNA storage heading for smart city
FIGURE 11.3 AI technology and DNA computing were banded together to engineer this futuristic cybersecurity ecosystem. On the top, we have all the city’s critical systems collaborating constantly together. The Smart Grid and the Smart Vaccine keep these engines running. Harnessing DNA for digital storage is revolutionary. The digital universe is running out of magnetic storage. DNA offers significant advantage that will be used in DIE.
Concluding revelation
331
Utilizing DNA for data storage is a brand-new field for the firm. A customized DNA sequence costs about 10 cents per base, with Twist hoping to get that price tag all the way down to 2 cents. Reading the data uses genetic sequencing, the costs of which have dropped substantially during the last 20 years. The human genome project, which ran from 1990 to 2003, cost about $2.7 billion dollars. The same process can now be done for about a $1000. These costs, though dropping, imply that commercial viability of artificial DNA storage is still way off, however the technology itself works. Microsoft says that its preliminary trials with Twist have proven that the method allowed complete retrieval of the encoded data from the DNA without any errors. In short, it works. If the costs of this technology can be bought sufficiently down, it means that in the future long-term data archiving may use similar technology as life itself. Fig. 11.4. describes the capacity differential between magnetic compact disk and DNA storage.
FIGURE 11.4 DNA’s information storage density is several orders of magnitude higher than any other known storage technology. For example, flash memory is capable of storing 1 bit of data in w10 nm (109), at best. DNA is capable of storing 2 bits per 0.34 nm. One kilogram of DNA can store 2 1024 bits; the same amount would require >109 kg of silicon for flash memory. A few tens of kilograms of DNA could meet the world’s storage needs for centuries to come. Picture Extracted from MERIT Cyber Security Library.
Concluding revelation In sum, what today’s storage data problem needs is a cure; not more fractured and fragmented products or an endless overlay of palliatives that mask the baggage of the storage industry’s failed architectural theories, which in turn rob human beings of their time and attention to manage the current mess of fragility and incompatibility.
332
Chapter 11 DNA storage heading for smart city
Obsolescence is another shortcoming of current storage methods. Just as it has become difficult to play your cassette tape collection (much to my chagrin), your old floppies and ZIP disks are not readable either. Scientists conjecture that because DNA is the basis for life on Earth, we will always have methods for DNA sequencing. This gives DNA a huge advantage for long-term archival storage. Luckily, DNA also has great fidelity over the long term. Scientists are increasingly able to recover readable sequences from ancient samples of DNA with the best results coming from samples stored at low temperatures. Thus, you could imagine a long-term storage system, like a secure server in a remote tundra, where the DNA back-up disk to restart civilization would be stable and safe. This is not completely crazy. The Svalbard Global Seed Vault is a huge storage site in the frozen tundra of Norway where scientists and governments are making contributions of plant seeds. The idea is to keep a stock of the original seed in case of the collapse of civilization. I am sure we could rent a shoebox-sized space there for storing all the relevant files from humankind (that means there probably would not be room for cat videos). Resource limitations will certainly continue to make digital DNA storage, borne of a thought experiment over beer, not just a reality but a necessity.
Appendices Appendix 11.A Glossary Smart Cities (MERIT Knowledge Library) Acute shocks: Sudden, sharp events that threaten a city. See chronic stresses. Agritecture: A unique way of combining urban agriculture, innovative technical solutions, and architecture to meet the demand for efficient food production within cities. Aquaponics: A sustainable production system for integrating aquaculture with hydroponic vegetable crops. Autonomous system: A system or network that gathers information, determines the needs, and issues a response or other machine to answer the call. Autonomous vehicle: A driverless car, unmanned aerial vehicle (UAV) “drone,” or unmanned underwater vehicle. A vehicle that can guide itself without human conduction. Bicycle detection and actuation: Sensors at regular traffic signals to alert the controller of bicycle crossing demand on a particular approach. Bicycle, electric assist: Where the pedal-assist electric drive system is limited to a decent but not excessive top speed, and where its motor is relatively low powered. Bike barometer: A device that uses sensors that are calibrated to be triggered by bikes, not cars or pedestrians, to count the number of bikes that ride by daily, monthly, and/or yearly. Bike boxes: A designated area at the beginning of a traffic lane at a signalized intersection that provides bicyclists a visible and safe way to get ahead of traffic during a red signal. Bike lane, buffered: Conventional bicycle lane paired with a designated buffer space separating the bicycle lane from the adjacent motor vehicle travel lane and/or parking lane. Bike share: Innovative transportation programs that are ideal for short-distance trips providing users the ability to pick up a bike at a self-serve station and return it at a different self-serve station within the system’s service area. Bike signal head: A traffic control device, exclusively for bicycles, used at an existing conventional traffic signal to improve safety, guidance, and operational problems at intersections where bicycles may have different needs than motorized traffic.
Appendices
333
Biodiversity: The variety of life in the world or in a particular habitat or ecosystem. Biofuel: A fuel derived from living matter. Biomimicry: The design and production of materials, structures, and systems that are modeled on biological entities and processes. Biophilia hypothesis: Edward O. Wilson’s theory that humans have an innate, genetic predisposition to connect or affiliate with nature. Biophilic city: A city that contains abundant nature and whose inhabitants care about, seek to protect, restore, and grow this nature, and who strive to foster deep connections and daily contact with the natural world. Biophilic design: An innovative method of design that incorporates elements of nature into the modern design to help restore and preserve our innate need to affiliate with nature. Bioswales: Human-built filtration systems that use soil, gravel, and plants to catch and process stormwater before being returned to groundwater. Bollard: A short postused to divert traffic from an area or road. Car, autonomous: See autonomous vehicle. Car, driverless: See autonomous vehicle. Car, electric: Uses energy stored in its rechargeable batteries, which are recharged by common household electricity. Car, hybrid: A car fueled by gasoline that uses a battery to improve efficiency. Carbon credits: A permit that allows a country or organization to produce a certain amount of carbon emissions and can be traded if the full allowance is not used. Carbon footprint: The amount of carbon dioxide and other carbon compounds emitted due to the consumption of fossil fuels by a particular person, group, etc. Carbon neutral city: No net release of carbon dioxide to the atmosphere, especially through offsetting emissions such as by planting trees. Carbon sequestration: The process of removing carbon from the atmosphere. Carbon tax: A tax on fossil fuels, especially those used by motor vehicles, intended to reduce the emission of carbon dioxide. Carrying capacity: The uppermost threshold for the number of humans that the Earth can sustain. Change, bottom-up: Change through social practices and individual choices. Change, top-down: Change through policymaking and regulation. Chicane: A chicane is a series of alternating midblock curb extensions or islands that narrow the roadway and require vehicles to follow a curing S-shaped path, which discourages speeding (traffic calming). Chicanes can also create new areas for landscaping and public space in the roadway. See traffic calming. Chronic stresses: Stresses that weaken the fabric of a city on a daily or cyclical basis. See acute shocks. Climate change: See global climate change. Closed-loop system: A system that does not exchange matter with substances outside of its own parts. Community farm alliance: A community organization strategy for connecting local farmers with the community by creating a direct local market. Typically, a community member can pay upfront for a regular supply of fresh produce from a local farmer.
334
Chapter 11 DNA storage heading for smart city
Community garden: A piece of land gardened by a cooperative group of people living in an area that encourages an urban community’s food security. Community-supported agriculture: A system in which a farm operation is supported by shareholders within the community who share both the benefits and risks of food production. Compact city: An urban planning and urban design concept, which promotes a relatively high residential density with mixed land uses. Also called a city of short distances. Contra-flow bicycle lanes: Bicycle lanes designed to allow bicyclists to ride in the opposite direction as motorized traffic on a one-way street. Crowdsourcing: The practice of obtaining information or input into a task or project by enlisting the services of a large number of people, either paid or unpaid, typically via the Internet. Cultural lag: The notion that culture takes time to catch up with technological innovations, and that social problems and conflicts are caused by this lag. Subsequently, cultural lag does not only apply to this idea only but also relates to theory and explanation. Cycle track: An exclusive bicycle facility that combines the experience of an off-street bicycle path with on-street infrastructure of a conventional bicycle lane. A cycle track is physically separated from motor traffic and distinct from the sidewalk. Demographic dividend: When countries’ age structures change favorably, meaning that they have more people of working age than dependents, they can see a boost to development, provided that they empower, educate, and employ their young people. Digital divide: The gulf between those who have ready access to computers and the Internet, and those who do not. Disaster, cascading: Natural disaster that leads to other disasters in a domino effect. Disaster, complex/compound: Multiple, interrelated disasters such as earthquakes, fires, and floods. Disaster, na-tech (natural-technological disaster): Natural disaster that creates a technological disaster such as power outages or nuclear incidents. Disaster resilience: A combination of a society’s preparedness for a hazard, their ability to mitigate, plan, and respond immediately and effectively to it, and their ability to recover and regenerate from the event. Disaster, synergistic: A disaster that is increased in severity by subsequent disasters. For example, an ice storm that creates impacts on transportation and power supply. Drone: See autonomous vehicle. Ecocity: A human settlement modeled on the self-sustaining resilient structure and function of natural ecosystems. An ecocity seeks to provide healthy abundance to its inhabitants without consuming more renewable resources than it replaces. Eco-innovative: The development of products and processes that contribute to sustainable development, applying the commercial application of knowledge to elicit direct or indirect ecological improvements. Ecological model: A network of relationships and interactions and theory that understands the Ecosystems services: Services provided by nature that humans and other organisms rely on in our everyday lives. Ecotourism: Responsible travel to natural areas that conserves the environment, sustains the wellbeing of the local people, and involves interpretation and education.
Appendices
335
Ecovillage: A community whose inhabitants seek to live according to ecological principles, causing as little impact on the environment as possible. eGovernment: The use of ICTs to improve the activities of public sector organizations. Some definitions restrict e-government to Internet-enabled applications only, or only to interactions between government and outside groups (i.e., eGovernment for Development). Electric vehicle: A vehicle that uses one or more electric motors for propulsion. Energy, alternative: Energy generated in ways that do not deplete natural resources or harm the environment, especially by avoiding the use of fossil fuels and nuclear power. Energy neutral: The total amount of energy used on an annual basis is roughly equal to the amount of renewable energy created on the site. Energy, nonrenewable: Energy from a source that cannot be replaced, such as coal, oil, and natural gas. Energy, renewable/sustainable: Energy from a source that is not depleted when used, such as wind or solar power. Energy, solar: The energy the Earth receives from the Sun, primarily as visible light and other forms of electromagnetic radiation. Environmental justice movement: Environmental justice is the fair treatment and meaningful involvement of all people regardless of race, color, national origin, or income with respect to the development, implementation, and enforcement of environmental laws, regulations, and policies. Environmental racism: The disproportionate impact of environmental hazards on people of color. Environmentalism: A broad ideology concerned with protecting the environment. Equity: The absence of avoidable or remediable differences among groups of people, whether those groups are defined socially, economically, demographically, or geographically. Equity lens: A transformative quality improvement tool used to improve planning, decisionmaking, and resource allocation leading to more racially equitable policies and programs. At its core, it is a set of principles, reflective questions, and processes that focuses on the individual, institutional, and systemic levels (i.e., Multnomah County). Experience economy: The next economy following the agrarian economy, the industrial economy, and the most recent service economy. See experiential realms. Experiential design: Experience design is the practice of designing products, processes, services, events, omnichannel journeys, and environments with a focus placed on the quality of the user experience and culturally relevant solutions. Experiential realms: The experience economy offers four realms of experiential value to add to a business. Pine and Gilmore (1999) termed these realms, the 4Es. The 4Es consist of adding educational, esthetic, escapist, and entertainment experiences to the business. First and last mile: Gaps in public transit that require individuals to use other forms of transport. Food mile: A mile over which a food item is transported from producer to consumer, as a unit of measurement of the fuel used to do this. Garden city: Intended to be a planned, self-contained community surrounded by “greenbelts,” containing proportionate areas of residences, industry, and agriculture. Global climate change: A change in global or regional climate patterns, in particular a change apparent from the mid to late 20th century onward and attributed largely to the increased levels of atmospheric carbon dioxide produced by the use of fossil fuels.
336
Chapter 11 DNA storage heading for smart city
Global climate change adaptation: Actions taken to help communities and ecosystems cope with changing climate conditions (i.e., UNFCCC). Global climate change mitigation: Any action taken to permanently eliminate or reduce the longterm risk and hazards of climate change to human life, property (i.e., IPCC). Global goals: See sustainable development goals. Green alleyways: Green alleyways convert underused alleyways into community assets and resources for environmental, economic, and social benefits. Green alleyways activate the public space for more than vehicular use and garbage disposal and involve a combination of environmental, health, economic, and social purposes. Green building: An environmentally sustainable building, designed, constructed, and operated to minimize the total environmental impacts. See living building. Green building materials: Composed of renewable and/or recycled materials, rather than nonrenewable resources. Green city: Urbanization in balance with nature. See sustainable city. Green index: A process used to determine the amount of environmental impact a city has. Green infrastructure: Man-made structure and technology that are designed with the intent of being green, that is, green energy in preference to dirty (nonrenewable, polluting) energy. Green roof: When plants of different varieties are planted on rooftops to facilitate increased plant matter. Their function can range from aesthetic to practical insulation or food source. Green street: A street right of way that, through a variety of design and operational treatments, gives priority to pedestrian circulation and open space over other transportation uses. The treatments may include sidewalk widening, landscaping, traffic calming, and other pedestrian-oriented features. Green wall: A living or green wall is a self-sufficient vertical garden that is attached to the exterior or interior of a building. The living wall’s plants root in a structural support, which is fastened to the wall itself. Green wave: A purposefully designed timing of a series of traffic lights to produce a green light for bicycles traveling at the correct speed (typically 12 mph) as they arrive at the lights. Greenscape: An area of vegetation in an urban area set aside for aesthetic or recreational purposes. Greenwash: Disinformation disseminated by an organization to present an environmentally responsible public image. Greenway: A strip of undeveloped land near an urban area, set aside for recreational use of environmental protection. Health disparities: Health disparities are preventable differences in the burden of disease, injury, violence, or opportunities to achieve optimal health that are experienced by people who have historically been made vulnerable by policies set by local, state, and federal institutions. Populations can be defined by factors such as race or ethnicity, gender, education or income, disability, geographic location, gender identity, or sexual orientation. Health disparities are inequitable and are directly related to the historical and current unequal distribution of social, political, economic, and environmental resources (i.e., Centers for Disease Control and Prevention). Health, environmental: The study of how environmental factors can harm human health and how to identify, prevent, and control such effects (i.e., University of Washington). Health equity: Health equity is achieved when every person has the opportunity to attain his or her full health potential and no one is disadvantaged from achieving this potential because of social position or other socially determined circumstances (i.e., Centers for Disease Control and Prevention).
Appendices
337
Health, public: Public health is the science of protecting and improving the health of families and communities through the promotion of healthy lifestyles, research for disease and injury prevention, and detection and control of infectious diseases (i.e., Centers for Disease Control and Prevention Foundation). Healthy community: One that is continuously creating and improving those physical and social environments and expanding those community resources that enable people to mutually support each other in performing all the functions of life and in developing to the maximum potential (i.e., World Health Organization). High albedo pavement: High albedo concrete is a special type of pavement that reflects more light than dark-colored materials due to its lighter color. This causes the concrete to have a lower surface temperature, resulting in less energy needed to cool surrounding buildings and less energy consumed by nighttime lighting. Inclusivity: An intention or policy of including people who might otherwise be excluded or marginalized, such as those who are handicapped or learning-disabled, or racial and sexual minorities. Intelligent city: See smart city. Internet of Things: The interconnection via the Internet of computing devices embedded in everyday objects, enabling them to send and receive data. Last mileage problem: A problem faced by transit agencies, how to get commuters to public transit without the use of individually owned automobiles. Leadership in energy and environmental design: An ecology-oriented building certification program run under the auspices of the US Green Building Council. Living alleyways: Living alleyways are narrow, low-volume traffic streets that focus on livability, instead of parking and traffic. Living alleyways are primarily for pedestrians and bicyclists as well as spaces for social uses. Vehicles are typically still allowed access but with reduced speeds. Living building: A concept that uses nature as the ultimate measuring stick for a building’s performance. Living wall: See green wall. Low carbon city: A low carbon city reduces its carbon footprint by focusing on renewable energy and mitigation measures. Low impact development: Development that through its low negative environmental impact either enhances or does not significantly diminish environmental quality. Midblock crossing: A location between intersections where marked crosswalks have been provided. The crosswalk may have signals or no signals. They offer a convenient location for pedestrians to cross in areas without frequent intersection crossings. Modernization: The process of adapting something to modern needs or habitats. Modular bike share: A bike share that is usually solar-powered, quick, and cheap to install can alter and move the stations and typically does not require trenching, excavation, or other preparatory work. Municipal solid waste management: The transfer, collection, and treatment of municipal solid waste. Net zero city: See zero energy city. Normalcy bias: The difficulty to comprehend the fact that a disaster is occurring.
338
Chapter 11 DNA storage heading for smart city
Open bottom catch basin: A component in a landscape drainage system. It is a box that is put into the ground near areas of standing water to help facilitate proper water drainage and avoid property damage. Overpopulation: The condition of having a population so dense to cause environmental deterioration, and impaired quality of life, or a population crash. Paris Agreement: The Paris Agreement is an international treaty that seeks to reduce the emission of greenhouse gases. Permeable pavement: Permeable pavement can be asphalt, concrete, or pavers, and let stormwater filter through and drain into the ground instead of collecting on hard surfaces or draining into the sewer system. It also traps suspended solids and filters pollutant from the water. Personal Rapid Transit: Also known as podcars, these vehicles are operated using a computer and can transport small groups of people using electric motors on lightweight tracks. An example of this transportation system is in Masdar City, United Arab Emirates, which is made up of 10 autonomous vehicles and is the only way of transportation throughout the city. Plastiblocks: Construction materials, generally blocks or bricks, made from recycled plastic-like materials. Pollution, plastic: Accumulation of plastic products in the environment that adversely affects wildlife, wildlife habitat, or humans. Progressive Era (1890e1920): A period of social and political reform that developed in response to the pitfalls of industrialization and urbanization. Psychogeography: The study of the effects of a city’s environment, consciously organized or not, on the emotions and behavior of individuals. Quality of life: A broad multidimensional concept that usually includes subjective evaluations of both positive and negative aspects of life. Although health is one of the important domains of overall quality of life, there are other domains as welldfor instance, jobs, housing, schools, and the neighborhood. Aspects of culture, values, and spirituality are also key domains of overall quality of life that add to the complexity of its measurement (i.e., Centers for Disease Control and Prevention). Rails to trails: The conversion of a disused railway into a multiuse path. Raised cycle track: Bicycle lanes that are vertically separated from motor vehicle traffic. Resilient city: One that has developed capacities to help absorb future shocks and stresses to its social, economic, and technical systems and infrastructures so as to still be able to maintain essentially the same functions, structures, systems, and identity. Reversible lane: See street, reversible. Romantic environmental paradigm: Draws attention to the destruction and domination of nature and calls for a more harmonious relationship between humans and the natural world. Self-driving car: See driverless car. Sense of place: Either the intrinsic character of a place, or the meaning people give to it, but, more often, a mixture of both. Sharrow: An arrow with a bicycle painted on vehicular lanes to indicate that cyclists have the right to use the road alongside vehicles. Sidewalk garden: A small garden that is planted on a street sidewalk to add biodiversity to an urban landscape. It helps slow runoff and is effective in stormwater management. A sidewalk garden also includes public seating to allow the public to spend time in the alley and have a natural setting to relax in.
Appendices
339
Singularity: See technological singularity. Smart city: An urban development vision to integrate ICT and IoT technology securely to manage a city’s assets. See smart sustainable city. Smart connections: A vital component of a smart city, incorporating transportation, online access, technology, and community. Smart container: A smart container is one that can connect wirelessly to a network and relay various amounts of information to a database. Smart parking management: Parking lot sensors for drivers and property managers. Smart street lighting: In addition to LED technology, street lighting can be motion-activated and gather environmental data. Smart sustainable city: A smart sustainable city is an innovative city that uses ICTs and other means to improve quality of life, efficiency of urban operation and services, and competitiveness, while ensuring that it meets the needs of present and future generations with respect to economic, social, and environmental aspects (i.e., ITU-T focus group). Smart urbanism: Smart urbanism merges information and communication technologies; energy, resource, and infrastructure technologies into networks that create sustainable, resilient, regenerative, urbanerural ecosystems with vibrant communities, thriving economies, and biodiverse environments. Smart waste management system: A waste management system that incorporates the use of information sensors, wireless internet, GPS tracking, and efficiency programs to assess and calculate optimal disposal strategies. Social cohesion: The willingness of members of a society to cooperate with each other to survive and prosper. Solar city: A city actively using solar energy to reduce or replace fossil fuels. Solar power: Power obtained by harnessing the energy of the Sun’s rays. Street, multifunctional: Street made to be able to perform more than one task or function at once or accessible for more than one function. Multifunctional streets provide green infrastructure, public space, green space, and other functions. Street, multimodal: Street designed for more than just car traffic and which puts a priority on public transportation, walking, and biking safely and efficiently. Street, reversible: These streets change directions at different times and for different purposes to maximize efficiency. Streetscape: The space that encompasses the road, sidewalk, strip, and sidewalk. Refers to the design and functionality of the area. Super app: A smartphone application that contains all kinds of services such as texting, video and voice chatting, paying, social status sharing, and other programs. Superblocks: Designated areas of land in a city that keep traffic from going through the streets, making room for alternative usage of city streets. Supertree: Singaporean engineered mechanical tree that provides solar power to Singapore, cleans the air through air ducts, and collects rainwater for conservatories. Additionally, these trees provide green space for animals, plants, and insects to live. Sustainability: Avoidance of the depletion of natural resources to maintain an ecological balance. Sustainable city: A city designed with consideration of environmental impact, inhabited by people dedicated toward minimization of required inputs of energy, water and food, and waste output of heat, air pollutiondCO2, methane, and water pollution. Also known as an ecocity.
340
Chapter 11 DNA storage heading for smart city
Sustainable design: Design practices that aim to reduce waste, pollution, and unnecessary consumption of energy and resources. Sustainable development: The organizing principle for meeting human development goals while at the same time sustaining the ability of natural systems to provide the natural resources and ecosystem services upon which the economy and society depend. Sustainable development goals: The Sustainable Development Goals, officially known as Transforming our world: the 2030 Agenda for Sustainable Development is a set of 17 “Global Goals” with 169 targets between them. Systematic racism: A practice that prioritizes the needs of white individuals over communities of color. Argues that racism is embedded in social, political, and institutional mechanisms within society. Technological singularity: The hypothesis that the invention of artificial superintelligence will abruptly trigger runaway technological growth, resulting in unfathomable changes to human civilization. [singularity] Thinking, systems: An approach that focuses on understanding the entirety of a model and not one individual component. Systems thinking looks to understand the interactions and relationships between components of the system. Traffic calming: The deliberate slowing of traffic in residential areas by building speed bumps or other obstructions. Tragedy of the commons: Economic theory of a situation within a shared-resource system where individual users acting independently according to their own self-interest behave contrary to the common good of all users by depleting or spoiling that resource through their collective action. Transition town: The terms transition town, transition initiative, and transition model refer to grassroot community projects that aim to increase self-sufficiency to reduce the potential effects of peak oil, climate destruction, and economic instability. Transportation, active: Any form of human-powered transportation such as walking, biking, skating, and skiing. Unmanned aerial vehicle: See autonomous vehicle. Urban agriculture/farming: The practice of planting, processing, and distributing food in a town or city. Urban ecology: The scientific study of the relation of living organisms with each other and their surroundings in the context of an urban environment. Urban regeneration: Comprehensive and integrated vision and action that leads to the resolution of urban problems and that seeks to bring about a lasting improvement in the economic, physical, social, and environmental conditions of an area that has been subject to change. Urban resilience: The ability to adapt to changing conditions and withstand and rapidly recover from disruption due to emergencies. Urban standardization: Regulation of urban development. Vertical garden/farm: An alternative form of farming stretching from the ground up, with the ability to be built in areas with limited soil space to grow crops. The indoor form of vertical gardens uses less water, less labor, and requires no sunlight due to LED lights. Waste diverted: Matter that would be converted into waste, but instead is converted into something more useful and beneficial. Waste to energy: Incineration of municipal solid waste that then uses the energy (in the form of heat) to produce electricity and/or steam for heating.
Suggested readings
341
Whitewashing: Refers to the lack of diversity within environmental organizations, causing the needs of communities of color to not be represented. Wildlife corridors: Routes designed to facilitate the migration and free movement of wildlife in and around urban areas, that is, green belts, land bridges. A method of compensation for habitat fragmentation. Wildlife crossing: The proper term for an animal-used land bridge or underpass (also called “critter crossing”). Zero carbon city: A zero-carbon city runs entirely on renewable energy; it has no carbon footprint and will in this respect not cause harm to the planet. Zero energy building: See living building. Zero energy city: A city with zero net energy consumption, meaning the total amount of energy used by the city on an annual basis is roughly equal to the amount of renewable energy created in the city. Zero waste city: A city that diverts all its waste from landfills into either reuse or recycling.
Suggested readings CISCI Index. https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/whitepaper-c11-738085.html. Data Brokers. https://www.mercurynews.com/2013/09/06/online-data-brokers-know-you-surprisingly-well/. Deakin, M., September 13, 2013. Creating Smarter Cities 1st Edition, first ed. Routledge. Mohammad, S., Obaidat and Petros Nicopolitidis, May 17, 2016. Smart Cities and Homes: Key Enabling Technologies 1st Edition, first ed. Morgan Kaufmann. Pine and Gilmore (1999) talk about four experience realms. https://www.researchgate.net/publication/238739528_ Looking_For_Lev_In_All_The_Wrong_Places/figures?lo¼1. Pine and Gilmore. http://srdc.msstate.edu/ecommerce/curricula/exp_economy/module1_4.htm. Stimmel, C.L., August 18, 2015. Building Smart Cities: Analytics, ICT, and Design Thinking 1st Edition, first ed. Auerbach Publications. “World Cities Report 2016” published by www.Unhabitat.org.
CHAPTER
DNA Data and Social Crime
12
Poverty is the parent of revolution and crime. dAristotle
In the 19th Century, Italian prison psychiatrist Cesare Lombroso drew on the ideas of Charles Darwin and suggested that criminals were atavistic: essentially evolutionary throwbacks. He suggested that their brains were mal-developed or not fully developed. In his review of prisoners, he found that they shared a number of common physical attributes, such as sloping foreheads and receding chins. In so doing, Lombroso suggested that involvement in crime was a product of biology and biological characteristics: criminals were born that way. Lombroso’s theory is essentially a theory of biological positivism. dLombroso and Biological Positivism
We cannot solve our problems with the same thinking we used when we created them. No problem can be solved from the same level of consciousness that created it. If I had an hour to solve a problem, I0 d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions. dAlbert Einstein
Sources of social crime On its surface, the smart city presents itself as a utopian dream: a technologically empowered panacea to many of the problems of modern city life. Yet, if we gaze beneath the glossy veneer and unpick its etiological roots, a different picture begins to emerge. In cities, a social crime is caused by the worsening of the social conditions that are the plague of the city and they will hinder its prosperity. These social conditions would include poverty, lack of opportunity, poor education, wrecked family structures, and high levels of alcoholism and drug use. So, a smart city might seek to lower the crime rate by addressing these considerations and providing the opportunity for good jobs with livable wages, good educational opportunities, stable family structures, and solid infrastructure. A smart city is a commercial construct designed to sell a particular vision of capital accumulation and the necessity of digital technologies to achieve this. Big tech companies employ the smart city as a narrative device to sell their products and services, which explains why it has so many incarnations. The main driver for social crime is the premature technological push of the big hi-tech lifestyle while the city is not ready for the digital reliance. The incremental changes in our social lives and interaction with others, including shifts in our institutions, technologies, and cultural practices, have left social Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00012-4 Copyright © 2020 Elsevier Inc. All rights reserved.
343
344
Chapter 12 DNA Data and Social Crime
crime on the back burner. Before unpacking these social sources of the crime increase, we need to look a little more closely at its timing and variation across offenses, from auto theft to murder. We will be discussing DNA computing technology to neutralize social crime. DNA computing will systematically implement DNA storage and security. DNA will be the powerhouse of information about social crime, which can be used with deep neural networks and supervised machine learning to identify the influencing variables of social crime and then eradicate it. Serious definition of Social Crime: Laws are meant to proscribe the behavior of the poor. “The law, in its majestic equality, forbids the rich as well as the poor to sleep under bridges, to beg in the streets, and to steal bread.” dAnatole France
Humans are generally, social beings and they love being in the company of friends & relatives which provide them comforts of well-being, informative and often entertaining as well. By nature, some are extroverts and other introverts, by choice. Both have positive and negative influences on the personality. Social and environmental influences do play a role in shaping our personality in addition to our own intelligence. Being deprived of it could gradually lead to extreme ill effects including crime as the human mind becomes a devil’s workshop. dShyam Asoor Krishnamachar
The curse of all times is called deprivation. Poverty takes first place on the podium of miseries. To be indigent is to be a no-parole prisoner in the no-walls prison for life. Poverty is the fast lane to sickness destined for mortality. Sinking to the bottom of the humanity barrel is not a choice, but a gloomy destiny. In 1979, the World Health Organization certified the global eradication of the smallpox disease in 1980. It is the only human disease to be eradicated worldwide.
Poverty as a pervasive social crime There is a wide spectrum of social conditions that lead to “street crime.” Poverty leads to lack of opportunity, poor education, wrecked family structures, high levels of alcoholism and drug use, and so forth. So, one might seek to lower the crime rate by addressing the topic of poverty. By providing the opportunity for good jobs with livable wages, good educational opportunities, stable family structures, and a solid infrastructure, crime might well be substantially reduced. We see a very low incidence of crime in affluent areas around the country. Unfortunately, addressing these problems is politically fraught and extremely expensive, and though many programs have been launched over many years, few, if any, have been successful. Global poverty is estimated to have declined in 2012 to 902 million people, or 12.8% of global population, according to the most recent data from the US Census Bureau August 2016. Fig. 12.1 illustrates how poverty is declining despite the substantial growth of population. The poverty headcount is forecast to fall in 2015 to 702.1 million, a poverty rate of 9.6%, the first time the share of people living in extreme poverty would be in the single digits. The number of people living in extreme
Poverty as a pervasive social crime
345
FIGURE 12.1 The graph clearly shows how poverty is going to dwindle despite the significant growth of population. Combatting hunger and malnutrition, increasing educational and employment opportunities, and migrating to smart cities are the main drives for improved quality of life.
poverty has reduced from 1.9 billion to 1.2 billion over the span of the last 20e25 years. Despite solid development gains, progress has been uneven, and significant work remains. With an estimated 900 million people in 2012 living on less than $1.90 a daydthe updated international poverty linedand a projected 700 million in 2015, extreme poverty remains unacceptably high. If we remain on our current trajectory, many economists predict we could reach global poverty “zero” by 2030e35. Although most regions continue to reduce poverty, meeting the global poverty target by 2030 remains aspirational in all but the most optimistic of scenarios. The incidence of extreme poverty has gone down from almost 100% in the 19th century to 9.6% in 2015. While this is a great achievement, there is absolutely no reason to be complacent: a headcount ratio of 9.6% means that 655 million people are still living in extreme poverty. Where do they live? John Wilmoth, Director of the Population Division in the UN’s Department of Economic and Social Affairs, came up with an interesting revelation: “The concentration of population growth in the poorest countries presents its own set of challenges, making it more difficult to eradicate poverty and inequality, to combat hunger and malnutrition, and to expand educational enrollment and health systems, all of which are crucial to the success of the new sustainable development agenda. India expected to become the largest country in population size, surpassing China around 2022, while Nigeria could surpass the United States by 2050.” Anomie is a concept developed by one of the founding fathers of sociology, Emile Durkheim, to explain the breakdown of social norms that often accompanies rapid social change. American sociologist Robert Merton (1957) drew on this idea to explain criminality and deviance in the United States. His theory argues that crime occurs when there is a gap between the cultural goals of a society (e.g., material wealth, status) and the structural means to achieve these (e.g., education, employment). This strain between means and goals results in frustration and resentment and encourages some people
346
Chapter 12 DNA Data and Social Crime
to use illegitimate or illegal means to secure success. In short, strain theory posits that the cultural values and social structures of society put pressure on individual citizens to commit crime. In trying to understand the genetic infrastructure of complex human behavior such as violence and aggression, it is important to focus on several concepts in mind: Genetic Complexity: In general, there is not a simple relationship between our genes and our traits. Our physical, mental, and behavioral states are the result of complex interactions between multiple genes in combination with our environment and our lifestyles. For example, height is influenced by the action of at least 180 regions in one’s genome in addition to environmental factors, including diet as well as maternal and childhood health. Biological Determinism: This is a framework for understanding humans through a biological lens and aims to explain complex human traits as being largely, if not entirely, dictated by biology, particularly our genes. The concept of biological determinism downgrades, if not dismisses, the role that culture and environment might have in shaping human behaviors. There has been much criticism of this idea, and many scientists are now focusing also on interactions between genes and environment and how that relationship may impact traits and behaviors. Population Genetics: Genetics research seeks to make connections between people’s genetic makeup and their traits. Often, the relationship is not a simple one, but rather a statistical correlation based on what percentage of people in the population with a shared genetic makeup exhibit a particular trait. Your DNA sequence can inform you about your predisposition for certain traits, such as your likelihood for reaching a certain height or your risk for developing a disease. It is important to note, however, that predispositions are not guarantees. As the cost of DNA storage continues to drop, and technologies become increasingly available to a broader population, it is likely that there are many useful or interesting things a person might learn from a personal genome sequence. However, the path to truly understand the utility and limits of this information will be a lengthy and sometimes confusing one. This is particularly true when it comes to seeking genetic explanations not just for complex human disease and traits but also for behaviors that must be understood in the context of environment, culture, and society. The ambiguity will clarify for the readers because as we collect DNA sequences on perpetrators, it becomes easier to extract meaningful trends and causality analyses.
Key information stored in DNA data storage Table 12.1 lists in detail all the key variables that play pivotal role in the security of a smart city. The city grid will be full of dynamic business activities monitored by the grid’s Smart Vaccine layer. All key operations information of all categories will be stored in DNA banks and be available for smart city forecast modeling, or for predictive analyses. We structured into five distinct domains, as shown in Fig. 12.2, the city’s performance data and moved it to a network of DNA storage devices for further planning analyses.
Key information stored in DNA data storage
347
Table 12.1 Smart city key data split by subject before being funneled into the subject-specific DNA library. Data Category
Subcategory 1
Subcategory 2
Subcategory 3
Subcategory 4
Governance and government Law enforcement
City leadership
City council Cybercrime cases
Community meetings Terrorists profiling
Energy management Ecology
Video crime monitoring Demographic crime reports Smart meter management Energy recycling
Citizen engagement Cyberterrorism
Power generation stations Waste management
Usage growth
Outages
Electric cars
Street lighting
Urban mobility
Green energy and ecology Smart parking
Smart ticketing
Smart traffic management Accidents Manemachine interface
Growth forecast
Internet of things
Metro busses Demographic records
Metro trains Housing logistics
Housing
Permits
Telecommunication Healthcare and medicine
Mobile statistics Hospital records
Construction contracts Networking Emergency records Medical caregivers Banking data
Finance and banking Education and academia Central coordination center Central coordination center City critical systems Digital immunity ecosystem DNA data storage DNA data storage
Patient records Blockchain transactions Universities Smart vaccine CCC Crime data and forecast Criticality Digital immune ecosystem Virus data management Data synthesis
DNA Data and Social Crime.
Academic institutions City CCC Early warning Usage and performance Early warning activities Vaccine data management Data sequencing and encryption
Funding Network capacity Cliniques and ACU FDA alerts Financial system data Student records
Machine-tomachine interconnection Demography
Diseases and breakouts Pharma activities
Graduations specialities
Dispatch activities Reverse engineering data User management Smart vaccine activities Grid activities DIE data libraries
Backups Attack activities Binary storage City operational data library
348
Chapter 12 DNA Data and Social Crime
FIGURE 12.2 Shielding a smart city from the malware universe is a great adventureda smart city is a journey. The systematic sequencing and storage in the DNA library will give the smart city a great security leverage. The DNA storage libraries will remain safe and encrypted but can be decoded upon request.
DNA can interpret the behavior of a mass killer Can we reverse engineer DNA and predict a crime? Wayne Carver, the state’s chief medical examiner, asked geneticists at the University of Connecticut to join the investigation of the December 14, 2012, killings at Sandy Hook Elementary School in Newtown. A retired FBI profiler said in response “I think it’s great to consider if there’s something here that would help people understand this behavior.” In principle, genetics could pinpoint a biological defect that could explain the rampage of Adam Lanza, who shot students, school staff members, his mother, and himself. Here are some the things we found: Lanza was “a nerd,” quiet, shy, smart, and he favored the dark, eerie aesthetic of Goth culture. So many kids fit in this profile it is almost inconceivable that they don’t have some genetic markers in common. A study published in December 28, 2012, by Marc Lallanilla titled “Genetics May Provide Clues to Newtown Shooting” claimed to show that criminal behavior can be predicted by genetics. Indeed, family identity was briefly implicated in the Sandy Hook massacredfor a few hours, Lanza’s brother was thought to be the shooterdbut that turned out to be another red herring. A detailed analysis of Lanza’s pedigree, however, might produce antisocial ancestors, which could aid in post hoc prediction.
Smart cities and eradication of cybercrime
349
It was clinically demonstrated that a form of a serotonin receptor (HTR2B) is associated with violent impulsive behavior. A study published in 2010 triggered wide speculation that this gene was responsible for Lanza’s criminal behavior. This genetic explanation, however, would not feature a single “massacre” gene. It would involve a complex profileda constellation of mutated form of a gene, which acts in combination and in certain environments, gives a high risk of violent action. The collection and analysis of DNA is an important tool in law enforcement as shown in Fig. 12.2. According to the FBI, as of 2018, almost 400,000 cases have used DNA evidence to aid in criminal investigations. In addition, the Innocence Project (https://www.innocenceproject.org/) states that over 350 people, several of whom were on death row, have been exonerated as a result of DNA evidence. As of 2018, over 16 million people in the United States have their DNA profile in a criminal offender or arrestee database, more than two-thirds of whom were added in the past 10 years. Some expect this number will continue to climb as a result of the 2013 Supreme Court decision in Maryland v. King, in which the Court ruled to allow law enforcement to collect DNA from people who are arrested, but not charged or convicted of a crime. As of 2018, all states and Puerto Rico collect DNA from individuals convicted of at least some felonies, and 40 states also collect DNA from those convicted of at least some misdemeanor charges. Thirty states and Puerto Rico collect DNA from individuals arrested, but not convicted, of certain crimes. For arrestees who subsequently are not charged or are acquitted, their DNA data may be expunged automatically (in a few states, such as Illinois, Maryland, and Texas) or upon request by the individual.
Smart cities and eradication of cybercrime Here is an interesting revelation: The first canonical rule states that malware runs on steroids, security runs on diesel. Cybercrime is an extension of human creativity. The criminal mind is designed with kinetic intelligence energy and dynamic creativity focused on achieving a task. This tack must have a target, a start, and an end. The crime must be asymmetrical and unique. In fact, cybercrime is the most influential driver in the creation of smart security. The brutal truth is that the whole continuum of crime will not be totally eradicated because it is part of the human psychoinfrastructure. There is a strong biological basis for criminality. The CBS hit television show “Criminal Minds” on its first episode on September 22, 2005, talks about criminals. Those who kill in the spur of the moment tend to have a poorly functioning prefrontal cortex, the area that regulates emotion and impulsive behavior. In contrast, serial killers and others who carefully plan their crimes exhibit good prefrontal functioning! Criminals carry criminal genes in their DNA!” Its twin brother, the Internet-centric cybercrime, has grown in importance along Internet, which has become central to every field like commerce, entertainment, public and private sectors, and lately Internet of everything (IoT). There are 1.5 million cyberattacks annually, which translates into over 4000 cyberattacks every day or 170 attacks every hour as displayed overwhelmingly in Fig. 12.3. It boils down to three attacks every minute!
350
Chapter 12 DNA Data and Social Crime
FIGURE 12.3 By 2050, the world’s urban population will have increased from 3.9 billion to 6.4 billion, resulting in 67% of the world’s population inhabiting urban areas. As the urban pilgrimage continues to accelerate, a new phenomenon will prevail, and cybercrime will continually accelerate at an exponential rate, using AI technology to penetrate smart cities and bring every system to a screeching halt.
Final thought The focus in this chapter has been on the impact of saving crime-poverty data in DNA digital storage (DDS) for the purpose of conducting predictive analyses and performing simulation models on how smart cities can eradicate poverty and consequently crime. How can we have a smart city with a smart economy, sustainability, and societal stability and prosperity? DNA profiling is the process of determining an individual’s DNA characteristics, which are as unique as fingerprints. DNA analysis intended to identify a species, rather than an individual, is also called DNA barcoding, see Fig. 12.4 for visual illustration. Later, we are going to tie DDS with two crippling social disgraces: poverty and crime. In this chapter, we define poverty as when a person’s material resources are not enough to meet his or her minimum needs, including social participation. Poverty definitions should be tied to material conditions and minimum standards. Poverty is centrally defined by a general lack of enough material resources, so it is not the same as social mobility, well-being, or inequality. We may also add that poverty is relative and dynamic rather than absolute and fixed. Crime technically means breaking the law as applied within a given jurisdiction at a particular time. Certain criminal laws have emerged and are enforced when a prosecuted person is contested. Although some laws, such as those against killing and the use of violence, seem universally agreed as based on society’s moral consensus, other laws may codify the interests of the influential, powerful, and rich and therefore are biased against the interests and wishes of the poor, the powerless, and those without a voice, as the title of Reiman and Leighton’s study attests: “The Rich Get Richer and The Poor Get Prison.” For example, the prison population contains disproportionate numbers of people who have lived in the most socially deprived areas. This is true in every city and country in the world. In particular, the countries that claim they are democraticdwe can list at least 20 countries that have the word democratic in their middle namedactually prove to be diametrically the opposite.
Final thought
351
FIGURE 12.4 In the design of smart cities, the planners need to collect quality data from different sources, such as CODIS, and capture all the crime episodes from the city. DNA storage is being used to collect analyses results. DNA data also can be used for AI applications, such as machine learning and neural networks. All this effort will contribute to building a city with smart living.
The different studies of poverty and crime use different definitions and measures of poverty, and the sources of crime data are problematic as to whether they are officially recorded by the police, courts, and other agencies or self-reported by offenders and victims (see Fig. 12.5). These definitional and methodological issues are addressed where necessary in the discussions. Yascha Mounk, a lecturer on government at Harvard University, eloquently described in his article in Journal of Democracy April 2018 that “The past few years have called these supposed certain-ties into doubt. Many developed democracies are experiencing the most rapid political change they have seen in decades: Citizens are becoming increasingly angry. Party systems that had long been stable are disintegrating. Most strikingly, in a wide swath of countries across Europe and North America, a new crop of populists has entered parliament or even ascended to executive power.” Forbes magazine, in its June 19, 2017 issue, printed an interesting article about “The problems with Smart Cities.” It stated at the beginning that “The ‘smart city’ sounds like a digital utopia, a place where data eliminates first-world hassles, dangers and injustices. But there are some problems with smart cities, and no one, to my knowledge at least, has pointed them out.” The article talks about technological issues: sensory overload, buying 3 billion batteries for the sensors, power shortage and capacity, too much orphan data, and redesign of the environment. All
352
Chapter 12 DNA Data and Social Crime
FIGURE 12.5 We decided to refer to this diagram as the “Smart City Chuck Holes.” City planners must study them and assemble a solution for each one of them. Not only third-world nations suffer from them but also modern (civilized) cities in the United States, Europe, and the Middle East have most of them as well. Los Angeles, Philadelphia, New York, Washington, and Paris are beautiful with attractive architectures, but they are strong candidates of poverty as the mother of crime.
these items are fine and dandy, but there is a big myth about smart cities which we are going to talk about: social crimes (poverty, crime, and unemployment), as shown in Fig. 12.5. We are going to be very critical about smart cities. We did our due diligence and gathered significant data regarding our arguments.
Justification for using DNA storage
353
Social crimes such as crime, poverty, and unemployment will remain with us as long as we live. Smart city planners and administrators pretend that they can eliminate social crimes with technology, but so far that has not happened, and it will not happen. Even if we decide to start a smart city from scratch, there will be monumental challenges that need to be addressed. What will be the optimum number of citizens for a new smart city? What is the best way to design it? Rectangular or circular? From where are we going to bring the new citizens? Are they wealthy, educated, with their families? So, let us decide to go with an existing city of 1 million people. We try to follow a detailed Gantt chart that describes the whole project with staggered piecemeal design. The housing construction phase comes first, with the assembly of prefab modules. It will take one painful year to build 1000 houses with considerable logistical and construction traffic. The next phase is for sensor installation, where every house needs a dozen sensors, followed by connecting the sensors to a central control center, which is an uphill challenge. The Gantt chart has at least 100 tasks that depend on one another. As we know from experience, complex systems have complex behaviors and they develop goals of their own the instant they come into being. In sum, it is not easy to convert “normal” citizens into smart ones. The word “smart” is beaten to death, and technology companies’ main interest is profitability and they ignore social changes that focus on serving all the citizens of the city.
Some interesting numbers about DNA data storage We believe DNA digital storage is the future and will play a vital role of all datacenters during the deployment of a futuristic city. The antivirus technology companies cannot afford to overhaul their products; they will come up with extensions to their current portfolios, but they do not have new disruptive products at the time where DNA data storage is taking the lead.
Some interesting numbers about datacenter power consumption People get confused about the difference between power and energy. Here is a simple example for first graders: if a rock weighs 10 kilos, then its weight is considered its power. To move to the second floor, you may need 20 min, then you must have enough energy in order to move it. One large server may consume around 20.4 kW per day. If we take the total servers of the 10 largest datacenters in the world, we may come up with Total power consumption ¼ 10 centers 1 million servers for total centers of company 20.4 KW per day 364 days ¼ 7.44 1011 kW per year This number will be growing at 40% per year. Storage will have the same predicament, and DNA will be the right solution at the right time. By 2025, the volume of consumed data will reach 163 1021 according to IDC’s report “The Expanding Digital Universe.”
Justification for using DNA storage Our smart city forecast is for a horizon of 20 years. The technology terrain will be drastically different, but better. The storage technology by then will also be different. Smart cities that have a mirror of
354
Chapter 12 DNA Data and Social Crime
production data in DNA will be in better shape and information could be easily decoded into binary format in order to conduct predictive analyses and learn how the city is progressing to eradicate social crimes.
Appendices Appendix 12.A Poverty State of the Health The question is now if smart cities will offer improved quality of life to its urban citizens, can they eradicate extreme poverty? “Poverty” is defined as a socioeconomically marginalized population. It creates a Himalayan challenge for smart cities planners and developers. Ironically, cities are considered the growth engines of economy, but they also attract poverty in large proportions. Consequently, posh urban sprawls exist amidst impoverished habitats called slums, where the poor inhabitants are condemned to live in subhuman conditions. The United Nations, in 1995, defined “extreme poverty” as a condition characterized by severe deprivation of basic human needs, including food, safe drinking water, sanitation facilities, health, shelter, education, and information. It depends not only on income but also on access to services. The World Bank published its Global Monitoring Report for 2015/2016 in which it defines “extreme poverty” as living on less than $1.90 per day. Governments are racing to build smart cities from old citiesdit is a great endeavor worth pursuing. The reason is obvious. Big cities are contaminated with big crimes, which is the causality of extreme poverty. So, cities are fighting the crippling effects of poverty with effective remedies such as affordable centrally located housing; youth and family services support including food, clothing, and mental health; strong schools and education services, especially for struggling and nontraditional learners; job training focused on employability and encouraging family wage jobs; access to quality early care and learning; and access to quality healthcare.
Appendix 12.B Glossary for Social/Hate Crime (Courtesy MERIT CyberSecurity Library). Abuse of household member (criminal): A domestic situation where one or more of the parties displays some sort of injury. This is an automatic arrest situation by the PD. Aggravated assault (criminal): An unlawful attack by one person upon another for the purpose of inflicting severe or aggravated bodily injury. This type of assault usually involves the use of a weapon or by means likely to produce death or great bodily harm. It is not necessary that injury results from an aggravated assault when a gun, knife, or other weapon is used, which could or probably would result in a serious potential injury if the crime were successfully completed. Arrests and referrals for disciplinary action: Any case involving illegal weapons possession and violation of drug and liquor laws (a referral or arrest must take place and a disciplinary action must be initiated and a record kept of the action taken and disclosed by location). Arson (criminal): Any willful or malicious burning or attempt to burn, with or without intent to defraud, a dwelling house, public building, motor vehicle or aircraft, personal property of another. Only fires determined through investigation to have been willfully or maliciously set and which occur on the institution’s “Clery geography” are to be reported as arson, including attempts to burn, incidents where an individual burns his or her own property, and where determined to meet the UCR definition
Appendices
355
of arson regardless of property damage. Fires investigated and determined to be of unknown or suspicious origin are not classified as arson. Assault (criminal): The intentional striking or hitting of another person with the intent to hurt or injure. Attempted theft (criminal): Any attempt to take or possess the property of another without that person’s consent. Burglary (criminal): The unlawful entry of a structure/dwelling to commit a felony and/or theft therein (tents, storage sheds, trailers, motor homes, house trailers, or other mobile units used for recreational purposes are not considered structures). For reporting purposes, this definition includes unlawful entry with intent to commit a larceny or a felony; breaking and entering with intent to commit a larceny; housebreaking; safecracking; and all attempts to commit any of the aforementioned. Car trouble assists (noncriminal): Assistance provided to start, open, or otherwise help a motorist documented as a Misc. Assist Case. Criminal property damage (criminal): The intentional, malicious, and deliberate damage of another’s property or things. Disorderly (criminal): Behavior that is loud, disruptive, repetitive, and aggressive and that will cause concern or alarm for another person or individual. Domestic (noncriminal): Argument between related parties where no intervention is required or injury has occurred. Drug abuse violations (criminal): Violations of state and local laws relating to the unlawful possession, sale, use, growing, manufacturing, and making of narcotic drugs. The relevant substances include opium or cocaine and their derivatives (morphine, heroin, codeine); marijuana; synthetic narcotics (Demerol, methadone); and dangerous nonnarcotic drugs (barbiturates, Benzedrine). Drug paraphernalia (criminal): Possession of these items is treated as contraband and the case referred to PD. Nonpossession (noncriminal) would be classified as Misc. Public and items turned over to the PD for disposal, and if they do not accept, then destruction is required and to be noted. Fire (noncriminal): Any incident that results in the damage to a building, land, property, and other material items and is not arson related. Fire alarm (noncriminal): The activation of any smoke or sprinkler device where notification has been made to the fire alarm panel. Forcible fondling (criminal): The touching of the private body parts of another person for the purpose of sexual gratification, forcibly and/or against that person’s will or not forcibly or against the person’s will where the victim is incapable of giving consent because of his/her youth or because of his/her temporary or permanent mental incapacity. Forcible rape (criminal): The carnal knowledge of a person, forcibly and/or against the person’s will or not forcibly or against the person’s will, where the victim is incapable of giving consent because of his/her temporary or permanent mental or physical incapacity (or because of his/her youth). Forcible sodomy (criminal): Oral or anal sexual intercourse with another person, forcibly and/or against that person’s will or not forcibly against the person’s will where the victim is incapable of giving consent because of his/her youth or because of his/her temporary or permanent mental or physical incapacity. Forgery/counterfeiting (criminal): The unauthorized use of someone’s signature to receive money or goods/the duplication of US currency or other negotiable item used to obtain goods or services.
356
Chapter 12 DNA Data and Social Crime
Found property (noncriminal): Items that are located and found by or turned in to security for safekeeping. Harassment (criminal): Behavior that involves an intent to annoy, harass, or alarm another in person, by telephone, computer, other communication device, or touches another person in an offensive manner. Hate crimes (criminal and noncriminal): Any behavior motivated by hate based on race, gender, religion, sexual orientation, ethnicity/national origin, and/or disability. Depending on the nature of the actions, the case can be either criminal or noncriminal but, in each case, must be reported as a Clery Category Offense. If a hate crime occurs where there is an incident involving intimidation, vandalism, larceny, simple assault, or other bodily injury, the law requires that the statistic be reported as a hate crime even though there is no requirement to report the crime classification in any other area of the compliance document (a “bias-related” (hate) crime is not a separate, distinct crime, but it is the commission of a criminal offense that was motivated by the offender’s bias). For example, a subject assault a victim, which is a crime. If the facts of the case indicate that the offender was motivated to commit the offense because of his bias against the victim’s race, sexual orientation, etc., the assault is then also classified as a hate crime. The type of crime shall be classified from the list below, as well as, the type of bias noted: • • • • • • • • • • • • • • • •
Race, religion, ethnicity/national origin, gender, sexual orientation, disability Murder/nonnegligent manslaughter Negligent manslaughter Sex offenses (forcible and nonforcible) Robbery Aggravated assault Burglary Motor vehicle theft Arson Liquor law violations Drug abuse violations Weapons law violations Larceny/theft Destruction/damage/vandalism of property Intimidation Simple assault
Incest (criminal): Nonforcible sexual intercourse between persons who are related to each other within the degrees wherein marriage is prohibited by law. Indecent exposure (criminal): The intentional and wanton display of one’s genitals to another person or persons. Intimidation (criminal/hate crime): To unlawfully place another person in reasonable fear of bodily harm through the use of threatening words and/or other conduct, but without displaying a weapon or subjecting the victim to actual physical attack. Intoxicated person (noncriminal): If behavior has not resulted in the commission of some other criminal offense this should be classified as a Misc. Public Case.
Appendices
357
Larceny (criminal/hate crime): The unlawful taking, carrying, leading, or riding away of property from the possession or constructive possession of another. Liquor law violations (criminal): The violation of laws or ordinances prohibiting the manufacture, sale, transporting, furnishing, possessing of intoxicating liquor; maintaining unlawful drinking places; bootlegging; operating a still; furnishing liquor to a minor or intemperate person; using a vehicle for illegal transportation of liquor; and drinking on a train or public conveyance; all attempts to commit any of the aforementioned (drunkenness and driving under the influence are not included in this definition). Lost property (noncriminal): Undermined location where the item was lost or misplaced, and the case will be classified as a Misc. Public Case. Medical cases (injured and sick) (noncriminal): Cases where a person is either sick and/or injured and where security and medical assistance has been requested. This will also be classified as a Misc. Asst. Case. Miscellaneous public (noncriminal): Cases that are basically noncriminal in nature where security services or assistance was provided. A subheading of the case or type of assistance provided shall be included. Missing students (noncriminal): Cases where students are reported missing for an extended period of time (usually more than 24 h) and either security or the PD has been notified. This is also a Misc. Public Case. Motor vehicle accident (noncriminal): Incidents involving motor vehicles or other motorized devices where damages or injuries are caused to the vehicles or other property. Motor vehicle theft (criminal): The theft or attempted theft of a motor vehicle (this includes any self-propelled vehicle, i.e., automobiles, sport utility vehicles, trucks, buses, motorcycles, motor scooters, mopeds, all-terrain vehicles, and snowmobiles). Murder/nonnegligent manslaughter (criminal): The willful (nonnegligent) killing of one human being by another (does not include suicides, fetal deaths, traffic fatalities, accidental deaths, assaults to murder and attempts to murder, situations in which a victim dies of a heart attack as a result of a crime, justifiable homicide [defined and limited to the killing of a felon by a peace officer in the line of duty or the killing of a felon by a private citizen during the commission of a felony]). Negligent manslaughter (criminal): The killing of another person through gross negligence (the intentional failure to perform a manifest duty in reckless disregard of the consequences as affecting the life or property of another). Do not include deaths of persons due to their own negligence, accidental deaths not resulting from gross negligence, and traffic fatalities. Property damage (noncriminal): Any damage caused to property or things of an unintentional or nonmalicious nature. Acts of God would be included, and the case reported as a Misc. Public. Case. Rape (criminal): See sex offenses, forcible. Robbery (criminal): The taking or attempting to take anything from value of the care, custody, or control of a person or persons by force or threat of force or violence and/or by putting the victim in fear. Sex assault, forcible (criminal): Any sexual act directed against another person, forcible and/or against that person’s will or against someone incapable of giving consent. Sex offenses, nonforcible (criminal): Unlawful, nonforcible sexual intercourse. Sexual assault with an object (criminal): The use of an object or instrument to unlawfully penetrate, however slightly, the genital or anal opening of the body of another person, forcibly and/or against that person’s will or not forcibly or against the person’s will where the victim is incapable of
358
Chapter 12 DNA Data and Social Crime
giving consent because of his/her youth or because of his/her temporary or permanent mental or physical incapacity. Simple assault (criminal/hate crime): An unlawful physical attack by one person upon another where neither the offender displays a weapon nor the victim suffers obvious severe or aggravated bodily injury involving apparent broken bones, loss of teeth, possible internal injury, severe laceration, or loss of consciousness. Statutory rape (criminal): Nonforcible sexual intercourse with a person who is under the statutory age of consent. Student conduct code (noncriminal): Any student behavior or activity that is expressly prohibited by HawCC. Although noncriminal, the activity will be documented as a Misc. Pub. Case and reported in Statistics, if disciplinary action is taken. Student housing violation (noncriminal): Any student behavior or activity that is contrary to UHH student housing policies or procedures. These cases will be reported as Misc. Pub. Cases (Applicable only to University of Hawaii at Hilo; HawCC has no student housing). Theft (criminal): The taking and or use of another’s property without that person’s permission or consent. Trespass (criminal): The illegal occupying or presence in or on the premises of a location from which a person or individual has been specifically warned to stay away from and where an arrest is made. Trespass warning (noncriminal): This is the initial phase of the trespass process where the person is put on notice that their behavior has necessitated that they no longer will occupy or be present at a specific location or locations for a period of 1 year. Vandalism (criminal/hate crime): The willful or malicious destruction, injury, disfigurement, or defacing of any public or private property, real or personal, without the consent of the owner or person having custody or control by cutting, tearing, breaking, marking, painting, drawing, covering with filth, or any other such means as may be specified by local law. Weapon law violations (criminal): The violation of laws or ordinances dealing with weapon offenses, regulatory in nature, such as manufacture, sale, or possession of deadly weapons; carrying deadly weapons, concealed or openly; furnishing deadly weapons to minors; and aliens possessing deadly weapons; all attempts to commit any of the aforementioned.
Appendix 12.C Glossary (Courtesy MERIT CyberSecurity Library) ACORN: A commercial product that splits people and families into different groups who share common characteristics based on where they live. Each of these groups has a profile which describes what they are like in terms of a wide range of behaviors and circumstances. Produced by CACI with a European-wide version and one for each country in the North Sea region based on local data. Business process reengineering: Basically, changing the way you do things and deliver services. This generally involves looking in detail at the internal processes you usedwho collects information, where is it sent, who makes decisions, how are they actioneddto see if some of them can be speeded up or removed. It often involves comparing how similar processes are delivered and looking at egovernment tools to see if they can deliver improvements such as electronic forms. Channel shift: Persuading customers to use cheaperdgenerally onlinedways of getting services or information instead of more expensive ones. This is particularly the case where services can be accessed on a self-service basis via things like electronic forms or directories.
Appendices
359
Channel strategy: Describing what is the best way for customers to contact the local authority in terms of customer satisfaction and cost effectiveness and then producing an action plan to deliver that. The action plan can include making sure the services are available in a way people can easily use and marketing the new ways we want them to access them. Codesign: Working together with customers to decide how services will be delivered. This generally means more than simply carrying out a consultation process asking for views and involves talking to them face to face. In the best cases, it will include in depth customer testing and even asking them to say how services work before starting to design them. Community profile: A pen picture of what we know about a particular area. This should include statistical information about the people and businesses based there and issues like crime, health, education, etc., but can also look at the physical environmentdopen space, roads, etc.dhard and soft resources such as schools, health services, clubs and societies, and the range and level of services being delivered in that area. Control: In testing new resources or approaches, a control is an area that has similar characteristics to the place where the test is taking place but where you do nothing new. This means you can be reasonably sure any changes in the test area are the result of what you are doing rather than general outside changes. CRM: Customer relationship management: The computer database on which you record information about what services people have requested and been given. Customer journey: The process people go through to access a service from their point of view rather than that of the organization. Who do they have to talk to (and how many), how long do they have to wait, what information do they have to give, and how are they treated and feel at each stage. The output looks like a standard process map but includes an assessment of their level of satisfaction at each stage. Customer profile: A pen portrait of a group of people or families who share similar circumstances. These circumstances include the sort of homes they live in, income and employment, skills, the size and age of the family, and their habits in things such as health, entertainment, shopping, etc. It will also include which services they regularly use. These are produced by matching statistical information (80% of people in this sort of job live in this sort of house) and building on this by attaching survey and other information where you look at the views or behavior of each profile. Data matching: Taking two or more datasets with a common element such as address and adding them together so you can see where there are statistical matchesd60% of people who borrow library books also go to museums. Data protection: Making sure personal information gathered by public or private organizations in the delivery of a service is not used for other purposes or shared without the permission of the individual involved. e-Government: Using information and communications technologiesdespecially the Internetd to deliver services that in the past have been delivered by talking to a person face to face or on the phone or on paper. These services can be internal processes as well as customer services and will include using Internet- or computer-based information to support person to person service delivery. Focus groups: A representative group of customers who are brought together to talk through issues based on their knowledge and experience. They can be representative either because their characteristics match those of the general public or because they have needs or interests or belong to a
360
Chapter 12 DNA Data and Social Crime
particular group. Generally, focus groups will have specific questions you want them to give answers to, but the discussion is more open ended than you would get with a questionnaire. Geospatial data: Data with an address attached. This can be a postal address, an area such as local government electoral division, or a map reference. GIS: Geographic information systems. Computer systems that present information in map form rather than words, figures, or other sorts of pictures. ICT: Information and communications technologies. Not just computers but also telephones, and any technology that stores, manipulates, communicates, or displays information electronically. Indicators: Data that help us understand what is happening in the real world. These can relate to process (what we do, such as the number of forms we process), outputs (what we produce, such as the value of welfare benefits we pay), or outcomes (what difference we make, such as a cut in the number of people below the poverty line). They can be indicators of actual change (numbers of qualifications children get) or a proxy indicator (the number of potholes we fill in as indication of how good the roads are). In some countries such as England, central government sets out National Indicators that other public sector organizations have to report on regularly and which are used to assess how well they are serving their local citizens and businesses. Intermediaries: People who help customers get access to our services who we do not directly employ. They can be friends and family of the customer, voluntary organizations helping particular groups, or other public sector organizations who offer one stop services or can signpost what is available from other agencies. Knowledge management: A structured approach to making sure what people know in the organization about how services are delivered and customers being served is available to everyone in the organization who needs it, when they need it. This is commonly done using networked computers including intranets or the Internet. Marketing: The process of making sure your products or services achieve their full potential in terms of sales or take-up. The full marketing process includes understanding customers and rivals in the marketplace, product design, and pricing, as well as marketing communications (marcomms) such as advertising, branding, and publicity. MOSAIC: A commercial product that splits people and families into different groups who share common characteristics based on where they live. Each of these groups has a profile that describes what they are like in terms of a wide range of behaviors and circumstances. Produced by Experian with a European-wide version and one for each country in the North Sea region based on local data. National indicators: In England, central government sets out national indicators that other public sector organizations have to report on regularly and which are used to assess how well they are serving their local citizens and businesses. OAC: An open-source product that splits people and families into different groups who share common characteristics based on where they live. Each of these groups has a profile that describes what they are like in terms of a range of behaviors and circumstances. Produced by the Office of National Statistics in the United Kingdom. Outcomes: What difference we make, such as a cut in the number of people below the poverty line or reduction in CO2 gases. Outputs: What we produce, such as the value of welfare benefits, we pay, or the number of people we train. Personas: The description of a typical member of a group such as a customer profile or priority group as if they were a real, named, individual.
Appendices
361
PRINCE: A UK-based, very structured project management methodology that is widely used by the public sector. The full PRINCE methodology is thorough and can be over complex for small projects so many councils in England use “cut down” versions that are more appropriate. Process mapping: A visual way of representing the process of delivering services to internal and external customers. This will typically look at each stage to understand which agencies and officers are involved, what the information flows are, where decisions are made, and any points where eligibility criteria are used. It is particularly useful where there are several ways in which the process can be started. Having mapped out the process, it is then easier to compare with similar processes to see if there is room for improvement. For example, is the process for applying for a concessionary bus pass the same for a student as a pensioner or someone on state benefits? Qualitative analysis: Looking at information that has been gathered as a result of open-ended questions. There is often a quantitative element to this through either comparing the number of people who gave similar answers or raised similar issues or through taking the answers from a small representative group and multiplying them up by the number of people in that group across your area. Although the questions are open ended, it is still important to use the same set of questions with everyone so comparing answers is easier. Quantitative analysis: Looking at information that has been gathered as a result of questions that have predetermined answers that people choose from. Sample sizes are generally bigger than for qualitative work although they are generally still representatives of the population as a whole rather than the whole population and answers are them multiplied up to estimate the view of the whole population. There is still a subjective element to rake account of as the answers given depend on the way the question is asked. Note that it is generally advisable to use both qualitative and quantitative techniquesdyou can use qualitative work to test the questions and answers with a bigger sample group or you can take the answers to quantitative work to get more detail with an in-depth discussion with a smaller group. Social marketing: Using marketing techniques to change people’s behavior rather than simply increasing take up of a product or service. Examples would be encouraging people to take more exercise or eat more healthily or to recycle more of their rubbish. Encouraging channel shift is a social marketing exercise. Super output area: The geographical area at which information from the national census is published in Britain. Generally speaking, this is the smallest area at which statistics are published and SOA’s are used to build up to larger areas like local government electoral boundaries or areas such as postcodes. Systems integration: Using software and data standards to link together or share information held on different computer systems. It is more and more common to do these using Internet-based standards such as XML schemas. Unit costs: The average cost of delivering a particular transaction or service. For example, customers may have to fill in a form to request a service. The unit cost of this will vary between face to face, telephone, or online methods because the cost of staff supporting that transaction and the building costs of face-to-face services will differ for each channel. The costs included, and then divided by the volume of transactions, can vary, but direct customer access staff and buildings and ICT costs should always be included. User needs: Understanding what customers need, not just in terms of the services they get but the way they get access to those services. This should always be assessed by actually asking customers.
362
Chapter 12 DNA Data and Social Crime
Suggested readings https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-682X.00067. The Hartford Courant, January 11, 2013; https://www.courant.com/. https://www.livescience.com/25853-newtown-shooter-dna.html. Could Genetics help us understand criminals https://www.courant.com/opinion/hc-op-comfort-geneticunderpinnings-of-newtown-sho-20130111-story.html. DNA Databases http://www.cfbdadosadn.pt/en/conexoes/adndireitos/Pages/default.aspx. DNS in analysis of social crime https://www.researchgate.net/publication/291385761_Approaching_ethical_ legal_and_social_issues_of_emerging_forensic_DNA_phenotyping. Expanding the Digital Universe https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020. pdf. Extreme Poverty http://pubdocs.worldbank.org/en/503001444058224597/Global-Monitoring-Report-2015.pdf. Glossary for Social/Hate Crime (Courtesy MERIT CyberSecurity Library) www.termanini.com. John, H., 2009. A New Breed Satellite Terrorism. Strategic Book Publishing. Kirby, L.T., 1992. DNA Fingerprinting. Oxford University Press. Yar, M., 2013. Cybercrime and Society, second ed.” Sage Publications. Problems in Smart Cities https://www.forbes.com/sites/forbestechcouncil/2017/06/19/the-problems-with-smartcities/#3c5dbf3b6067. Reiman and Leighton http://sk.sagepub.com/reference/encyclopedia-of-criminal-justice-ethics/n291.xml. Smart City Problems https://www.forbes.com/sites/forbestechcouncil/2017/06/19/the-problems-with-smart cities/ #47e2bbbd6067. Mounk https://www.journalofdemocracy.org/authoreditor/yascha-mounk. Robert K. Merton https://www.thoughtco.com/robert-merton-3026497.
CHAPTER
DNA data and cybercrime and cyberterrorism
13
We all remember General George Patton’s famous quote: May God Have Mercy Upon My Enemies, Because I Won’t . dGeorge Patton.
We have to keep building our security walls higher and higher, because these cyber criminals are building longer and longer ladders. dDame Dido Harding.
Progress imposes not only new possibilities for the future but new restrictions. dNorbert Wiener.
The best way to predict the future is to create it. I have been saying for many years that we are using the word “guru” only because “charlatan” is too long to fit into a headline. dPeter F. Drucker, on predicting the future.
Opening thoughts There’s a heavy fallacy that spreads around geneticists, genetics professionals, biomedical researchers, and the care-giving community in general, that DNA (deoxyribonucleic acid)-which was identified by Francis Crick and James Watson at the Cavendish Laboratory within the University of Cambridge in 1953, contains a molecular copy of DNA from the victim and suspect to help solve a crime. DNA is a formidable discovery that tells us where we have been by establishing proof of paternity. While DNA contains material common to all humans, some portions are unique to everyone. DNA profiling is a systematic method of establishing identity of a person (very much like fingerprinting). We cannot ignore the magic benefits of DNA in medicine. To sum it up, I can say the following: If someone has a chronic disease, such as, let’s say, muscular dystrophy, his or her DNA has a mutation (a defect) that is affecting the growth of this person’s muscles. Today, we have genetic surgeons who can cut off (with CRISPR) the mutated DNA part and “saw” the DNA back to normal. And guess what? The patient will regain his or her muscle and will be walking one day! DNA shines in another innovative domain as welldDNA data storage (DDS).
Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00013-6 Copyright © 2020 Elsevier Inc. All rights reserved.
363
364
Chapter 13 DNA data and cybercrime and cyberterrorism
DNA is our binary Holy Grail of data storage DNA, when stored in cool dry conditions, is remarkably stable. The successful sequencing of 700,000year-old horse DNA, recovered from the arctic permafrost, is a testament to its longevity. We are riding on the Internet snowball and we are going downhill. Technology companies are producing new gadgets every yeardyes, every yeardand they are pulling the carpet from underneath us. We cannot buy new products every other day. With the Internet, things have gotten worse. According to www. internetworldstats.com/, the world will be loaded with 4,208,571,287 internauts in 2020dmore than half of the world that makes up the phenomenal cybercosmos. Fig. 13.1 shows the strong correlation between number of Internet connected users with the number of connected devices. Last year, Cybersecurity Ventures on its website https://cybersecurityventures.com predicted the following eye-popping data: • •
Cybercrime will cost the world $6 trillion annually by 2021. Cybercrime did cost $3 trillion in 2015.
FIGURE 13.1 This chart, provided by CISCO Global Cloud Index, shows us the relationship between Internet users and their used devices. The scale on the right represents the users; the scale on the left represents the devices on the Internet. In 2024, for example, the number of users is about 4.9 billion, while the number of devices used that year is 62.12 billion. This shows a ratio of one user could have up to 12 devices. In the graph, the devises were connected either to datacenter or to the cloud.
Adding insult to injury
• • • •
365
Global spending on cybersecurity products and services will exceed $1 trillion over the 5-year period from 2017 to 2021. There were 3.8 billion Internet users in 2017 (51% of the world’s population of 7 billion). There will be 6 billion Internet users by 2022, which is 75% of the projected world population of 8 billion. By 2030, there will be 7.5 billion Internet users, which is 90% of the projected world population of 8.5 billion, 6 years of age and older.
Let’s forget about cybercrime for a moment and focus on the data storage crisis which is going to impact the whole world. Here’s my assessment: We have 4.3 billion Internet users (online and offline). All these users require hardware, software, (storage), and network (connectivity). These information workers are generating an astronomical amount of information.
Behavior of the cybercriminal The following is an excerpt from https://www.researchgate.net/publication/228,684,366_Examination_of_Cyber-criminal_Behaviour: Criminal behavior is defined as an act (or failure to act) in a way that violates public law. The criminal mind is modulated by society through family and community pressure, education, and law enforcement. The intent behind criminal behavior may be subdivided by a distinction between direct or oblique, with the former referring to premeditated or deliberate cybercrime (such as compromising a critical system) and the latter indicating accidental or indirect crime (such as stealing DNA medical records for sale).
Adding insult to injury We recognize with lament two facts related to US cybersecurity. First, the reason we still have cyberterrorism is because we are not capable of defeating it or even controlling it. Second, the United States is the inventor of the Internet and we gave it to the world on a silver platter. Now the United States is playing third fiddle and is not able to regain the leadership it deserves. Take, for example, the arrogant countries of Russia, China, Iran, and North Korea, which routinely are launching cyberattacks on peaceful cities, robbing technology data from private companies, or spying on military sites by using online tools to spread twisted propaganda, and hiring unethical computer technicians (mercenaries) to do the dirty work. There are over 1500 training centers and academies teaching radicals how to become cyber fighters and militants. It is like training dogs to be killer machines; students are mind-programmed to use the most lethal tools of the Devil. The techno-espionage is ramping up bundled with superior innovation and bad blood. The Silicon Valley, located in San Francisco South Bay Area, has been suffering from shortage of computer specialists. Every month, dozens of new tech companies are launched and are looking for top-notch computer specialists. According to the August, 16,000 open technical positions were reported. These high-tech companies either lure specialists from domestic or international competitors or outsource projects. This opens the front gate to major potential espionage and theft of intellectual property. It is a horrific technical bleeding reality that has no effective solution.
366
Chapter 13 DNA data and cybercrime and cyberterrorism
Anatomy of cyberterrorism Nearly 75% of the world’s population practice one of the five most influential religions of the worlddBuddhism, Christianity, Hinduism, Islam, and Judaism. The Internet is the keystone and the sole causality of cyberterrorism. Here is the causality relationship of hostility that drives cyberterrorism, including all the factors that lead to the crime: Hostility ¼ Religious fanaticism þ political conflict þ social revenge þ territorial gain þ radical activists þ intimidation þ technological rivalry. These are some of the major influencing factors that lead to cyberterrorism. Obviously, cyberterrorists, who belong to radical organizations, will not be intimidated to reveal their slick terror act. Often, government-sponsored cyberterrorism acts will be denied or politely be kept silent. It is said that tomorrow’s terrorists may be able to do more damage with a mobile phone than with a bomb.
Cybercrime runs on steroids; antivirus technology runs on diesel Where did the word cyber come from? The success of an industry is reliant on its economicsdthe production, allocation, and use of its goods and services. Cybercrime, like any other industry or business, maintains its own economy of commoditized products and services. Dr. Norbert Wiener when he was teaching mathematics at MIT, in his book “Cybernetics” was quoted saying “use the word cybernetics because nobody knows what it means.” This will always put you at an advantage in arguments. So, he adopted the term “cybernetics” as a title for his book Cybernetics. Wiener borrowed the ancient Greek word “cyber,” which is related to the idea of government or governing. Indeed, the only time the word “cybernetics” had appeared before was in a few works of political theory about the science of governance. The faddish word “cyber” has now become a main ingredient of rocket fuel. Dr. Wiener then expanded the realm of the word which became a fancy term in cyber world. Fig. 13.2 adds a little more information on the term of Cybernetics. Cybercrime is like tropical vegetation in a rainforest which is an ecological wonder with great impact and no control. Cybercrime is another technological phenomenon that is growing at an exponential rate, as it was eloquently expressed by Ray Kurzweil: An analysis of the history of technology shows that technological change is exponential, contrary to the common-sense “intuitive linear” view. So, we will not experience 100 years of progress in the 21st centurydit will be more like 20,000 years of progress (at today’s rate). The “returns,” such as chip speed and cost-effectiveness, also increase exponentially, even exponential growth in the rate of exponential growth. Fig. 13.3 is like a live movie showing how cybercrime has started to embrace our societal fabric and impede our prosperity and our living. Bill Gates, founder of Microsoft, once said “If GM had kept up with technology like the computer industry has, we would all be driving $25 cars that got 1000 MPG.” This is not funny. Smart cities are a great innovative paradigm drenched with technology; they have become the haven of professional crime syndicates promoting their cliche´ crime-as-a-service. Fig. 13.4 is a great testimony how urbanization and cybercrime became strongly correlated.
Cybercrime runs on steroids; antivirus technology runs on diesel
367
FIGURE 13.2 In his writing, Wiener described what was, at the time, a futuristic ideadthat one day there would be a computer system that ran on feedback. Essentially, it would be a self-governing system. And for a long time, cybernetics remained the purview of information theorists like Wiener and early computer programmers.
FIGURE 13.3 Cybercrime progressive evolution and sophistication (as shown in the dotted line). We are in the second generation of malware and are still struggling to overcome malware that plagued the world. Obviously, malware is a trillion dollar business managed by an underground network of politicians and computer engineers (state and nonstate groups) to disrupt any successful entity and sneak into the healthy fabric of our society. We maybe chronically be in the second generation of malware, but we are still suffering from 3D generation attacks.
368
Chapter 13 DNA data and cybercrime and cyberterrorism
FIGURE 13.4 This chart is a real testimony to how the cyber world has been slowly eradicating, like termites, all smart cities in the United States and the world. Data was acquired from the US Census Bureau, International Database.
Cybercrime data repositories In 2015, the United Nations Office on Drugs and Crime (UNODC), under the framework of the Commission on Crime Prevention and Criminal Justice (CCPCJ), launched the cybercrime repository, central database of legislation, case law, and lessons learned from cybercrime and electronic evidence. The website is very rich with an impressive sitemap on every topic of cybercrime continuum: https://www.unodc.org/unodc/en/cybercrime/cybercrime-repository.html. The cybercrime repository aims to assist countries in their efforts to prevent and effectively prosecute cybercriminals. Interpol is committed to the global fight against cybercrime, as well as tackling cyber-enabled crimes. Interpol’s main initiatives in cybercrime focus on the following: • • • • • •
Operational and investigative support Cyber intelligence and analysis Digital forensics Innovation and research Capacity building National cyber reviews
The storage supply is killing the storage demand
369
The UN Cybersecurity Repository is a massive index of cybercriminal case law and lessons learned used to train law enforcement officers, prosecutors, and judges. The UN Cybercrime Repository was developed to promote technical assistance and strengthen international cooperation in the fight against cybercrime. The repository is the only tool in the world that archives cybercrime laws, cases, and key takeaways in a searchable database. The rapidly growing index cross-references global cybercrime incidents by topic, including global cyber investigations, requests for ISP-stored traffic data, incidents of real-time traffic collection, and key takeaways. The Background Investigation Bureau’s (BIB) comprehensive proprietary Criminal Record Database search expands the scope of the baseline search methodology by thoroughly examining over 450 million unique records that it regularly collects from over 3200 public agencies. Its massive database, coupled with its proprietary identity mapping technologies, has proven to consistently produce valuable information that may otherwise be overlooked. The FBI’s Next-Generation Identification (NGI) is a face recognition database of nearly 30 million civil and criminal mug shot photos. It also has access to the State Department’s Visa and Passport databases, the Defense Department’s biometric database, and the driver’s license databases of at least 16 states. Totaling 411.9 million images, this is an unprecedented number of photographs, most of which are of Americans and foreigners who have committed no crimes. Germany set up its national DNA database for the German Federal Police (BKA) in 1998. In late 2010, the database contained DNA profiles of over 700,000 individuals and in September 2016 it contained 1,162,304 entries. The Library of Congress is worth mentioning as it has more than 38 million books and other printed materials, 3.6 million recordings, 14 million photographs, 5.5 million maps, 8.1 million pieces of sheet music, and 70 million manuscripts. The library has around 67,000,000 total items made of paper or magnetic storage. The Combined DNA Index System (CODIS), the DNA database which is maintained by the Federal Bureau of Investigation (FBI), contains case samples (DNA samples from crime scenes or “rape kits”) and individuals’ samples (collected from convicted felons or arrestees) that are compared automatically by the system’s software as new samples are entered. The DNA Atlas, the name of the Human Cell Atlas (HCA) itself should be helpful: The term “atlas” creates an instant image of the scale and objectives of the project, in a way that a reader can look with detail beyond continents and countries into individual regions and towns. HCA will enable scientists to visualize and characterize individual cell types within different organs and tissues of the body. A portal to HCA data for the general public that uses a “Google Maps” approach to free exploration would help forge this connection. From a law enforcement perspective, HCA will offer a deep anatomy about “loose” criminals that may not be aware that their DNA will lead them to justice.
The storage supply is killing the storage demand In the August 2018 issue of IEEE Spectrum magazine, an article titled “Why the Future of Data Storage is (Still) Magnetic Tape,” it states that “Indeed, much of the world’s data is still kept on tape, including data for basic science, such as particle physics and radio astronomy, human heritage and national archives, major motion pictures, banking, insurance, oil exploration, and more.” But no matter how much tape or magnetic storage companies push the envelope, scalability is critical as it gets smaller and, eventually, these companies must start looking for a brighter storage future. If someone from the futuredtwo decades or two centuries from nowdtraveled back in time to
370
Chapter 13 DNA data and cybercrime and cyberterrorism
FIGURE 13.5 This composite graph is an eye-opener. It is quite apparent that all technology companiesdstorage and antivirusdmust be aware of our expanding digital universe. The information consumption run away shown with a steeper trajectory while the storage supply curve is getting is far behind the information consumption curve. DNA storage will prevail at the right time to save the world from the data deluge.
today, they would probably chuckle at our use of hard drives and USB sticks, the way we now wonder how we ever survived with floppy disks and Zip drives. Fig. 13.5 represents the peek at the dataconsumed-versus-available-magnetic storage situation of the current decade. What kind of storage devices we will be using in the future? DNA digital storage, here is what the future of data storage technology might look like.
DNA is the holy grail of digital storage Perhaps the most viable futuristic data storage technology is DNA. Yes, the molecules that store biological information could be used to store other kinds of data. In 2012, Harvard researchers lead by Dr. George Church were able to encode DNA with digital information, including a 53,400-word book
Back to smart city
371
in HTML, 11 JPEG images, and 1 JavaScript program: https://www.academia.edu/33418416/Hard_ Disk_Drives. DNA offers incredible storage density, 2.2 petabytes (1015 bytes) per gram, which means that a DNA storage at the size of a teaspoon could fit all of the world’s data on itdevery song ever composed, every book ever written, every video ever shared. Besides the space savings, DNA is ideal for long-term storage. While we are lucky if our hard drive lasts 4 years and optical disks are susceptible to heat and humidity, Harvard researcher George Church quoted saying “You can drop DNA wherever you want, in the desert or your backyard, and it will be there 400,000 years later.” As you might imagine, DNA takes a long time to read and write, and the technology is still too expensive to be usable now. According to New Scientist, in one recent study the cost to encode 83 kilobytes was £1000 (about US$1500). Still, scientists are encoding information into artificial DNA and adding it to bacteria. It’s like a sci-fi novel that is currently being written and lived. DNA could be the ultimate eternal drive one day. Inventors and researchers continue to push the envelope when it comes to capacity, performance, and the physical size of our storage media. Today, Backblaze data storage provider stores 150 petabytes (1015) of customer data in its data centers, but in the future, they will likely be able to store an almost incomprehensible amount of datadzettabytes (1021), if not domegemegrotte (1033) bytes. (Nice names, right? A petabyte is equivalent to one million gigabytes; a zettabyte (1021) equals one million petabytes; and a domegemegrotte byte equals 1000 zettabytes (1024).) With the human race creating and saving an exponential amount of data, this is a great thing and the future of data storage is exciting.
Back to smart city Here is an interesting revelation: The human body is actually a smart system. In fact, the body resembles the most miraculous Internet of Things (IoT) system. There are over 40 billion cells in the body where each and every cell is equipped with a smart receptor (sensor) that reliably sends and receives intelligent signals from the brain. There are two types of neural signals: (1) sensory signals (neurons), such as sight, sound, feeling, taste, and sense of touch, are sent to a central unit in the brain for translation and response; and (2) motor signals, which are sent back to the receptor for action. The brain registers the transaction as a knowledge record. The immune system goes on alert to eradicate the attacking pathogen. The memory in the brain provides the history of the attack which gives the body the upper hand when it comes to winning the next battle. Here is our definition of Smart City: a city that resides on the top of a very intelligent grid, like the human nervous system. A smart city needs a central coordination center to capture, analyze, and respond to all the messages in the city. A smart city needs a very conductive real-time grid to transport safely the intelligent messages without collision or conflict. And finally, a smart city needs sensors that are connected to all the devices to collect and receive the intelligent messages. This is the basic layout of the city. No smart city is complete without a predictive defense mechanism that predicts and detects all the incoming surprise attacks. City enemies can quickly design a distributed attack that tarnishes the glamor of the city. Just thinkdthe human body without immune system. Adaptive digital immunity is the Holy Grail of the smart city. Hacking guerillas have a team of savvy cyberspies who could fire an army of Trojans and profile all the defense technologies in the city and return to their ateliers and craft a nasty invasion with the same city technologies. The smart city is a major disruptive phenomenon characterized by two driving waves: the first one is the unstable human chemistry that creates massive social turbulence, which includes leadership clashes,
372
Chapter 13 DNA data and cybercrime and cyberterrorism
FIGURE 13.6 This is a timeline representing the major activities for a generic smart city. The success of building a smart city is monumentally challenging because there are hidden factors that cannot be quantitatively estimated at the start. The circles represent the critical activities of cybercrime. The dashed lines represent stages that are in trouble and are slowing the progress while the money meter is still running.
polarized politics, territorial wrangles, vision myopia, and financial drought. The second driving wave is technological hysteria, which has been raging for the last 20 years, promoted by technology providers, engineers, and academic research. All this fancy talk leads to one conclusion: While everyone is excited about the smart city, there is an underlying concern about how and where to start. In a near-zero visibility situation, the best alternative is to build a model of a smart city, by using simulation and predictive analytical tools to evaluate the necessary integration process. We formulated a complex polynomial that represents all the influencing variables that make up the smart city solutions. V scity ¼ Vmgmt þ Vplan þ Vegov þ Vbudget þ Vsafety þ Vperformance V security þ Venergy þ Vhealth þ Vdemographics þ Veducatio V infras þ Vtraffic þ Vwaste þ VIoT þ Vwater As we can see, building a smart city is a complex endeavor and is as risky as going to Mars. We want to show you an interesting scenario about the topic of smart city longevity, the factors that influence the progress of the ecosystem, and the components that must be put together to build a scalable framework for the smart city. As shown in Fig. 13.6, the project stretches for 20 years and has three timelines: • • •
Planning and Engineering Tier: The broken line shows the difficulties that will impact the delivery date of the project. Citizens and Implementation Tier: We have 6 challenging areas that need to be reviewed or reexamined. It indicates that the planning tier was not thoroughly examined. The Performance and Security Tier: This is the system launch area while checking the performance of the system and the deployment of the security system (digital immunity ecosystem and the transfer of operating data to DDS.
Back to smart city
373
There is a fancy myth that smart cities are the ultimate Shangri-La. It is a sweet dream, like riding a time machine to go back 2000 years to meet Jesus Christ! But it won’t happen for 200 years. One of the five-star reasons is crime and its sibling cybercrime. Cybercrime is an extension of human creativity. One of the biggest challenges, even headaches, that faces smart cities is cybercrime. One of the marvels of the human immune system is that no backdoor virus or Trojan could sneak into the body, from anywhere, without its knowledge. This is because the white blood cells are incredibly cognitive and smart machines that will rush to the infectious spot and eliminate the foreign invaders. These cells are the defending army and are running in the streams of blood like a nebulous grid watching over every cell in the body. This is impressive, isn’t it? We need to build a cognitive early warning predictive system to protect the city and its critical infrastructures. Gartner Hype Cycles provide a graphic representation of the maturity and adoption of technologies and applications and how they are potentially relevant to solving real business problems and exploiting new opportunities. Gartner Hype Cycle methodology gives you a view of how a technology or application will evolve over time, providing a sound source of insight to manage its deployment within the context of your specific business goals. The Digital Immunity Ecosystem (DIE), also known with its alias as Cognitive Early Warning Predictive System (CEWPS), does not necessarily follow Gartner’s graph as shown in Fig. 13.7. It behaves completely in a different way, by keep moving upward and levels off when it reaches its maturity status at phase (CEWPS-5).
FIGURE 13.7 The Digital Immunity Ecosystem (DIE)/DNA graph is plotted over 10-year stretch. The graph shows the new advances incorporated in DIE/DNA. The Hyper Cycle has a global reputation of prediction. The DIE/DNA trajectory, however, does not follow the Hyper Cycle, as shown in the diagram. We believe DIE/DNA will have a several new generations and eventually will be globally used.
374
Chapter 13 DNA data and cybercrime and cyberterrorism
FIGURE 13.8 We all know that complex systems have complex behavior. Complex systems develop goals of their own the instant they come into being. This is a true testimony of how smart citiesdthe brand-new onesdbehave. The diagram represents the creation of a brand new “smart” city. We have created a timeline for the first 10 years with 4 milestones. The triple line represents the behavior of the budget, while the dotted line displays the encountered obstacles. Our formulated general rule is that no city is smart or will be perfectly smart.
We have created a 10-year timeline for a project to demonstrate the temporal interaction between scheduled and actual data. For simplicity reason, we split the project into four stages. The triple line represents the behavior of the budget, while the dotted line displays the encountered challenges. Our formulated general rule is that no city is smart or will be perfectly smart. Fig. 13.8 represents the budget and level of effort in the creation of a brand new “smart” city. Fig. 13.9 shows how law enforcement will be able to have a good handle on crime in Smart City. All the operational data (the central repositories) generated from the city critical systems get filtered and streamed before the files get converted to DNA code and stored in DNA library, which is available at the Law Enforcement Support Center for crime tracking and analysis. The repositories on the left are the DIE knowledge engines keeping all the deep learning files on cybercrime. The repositories on the right are the international cybercrime cases.
What is cyberterrorism? Cyberterrorism is a formidable weapon that the Internet has plagued the world with. In fact, it is a very sophisticated profession. The elite club is full of cyber experts who know how to build a self-propelled
What is cyberterrorism?
375
FIGURE 13.9 Crime data have three categories: historical record of city citizens; general crime attributes; and correlation between citizen and past crimes. For the law enforcement of the smart city, they will also use machine learning algorithms and deep neural network analyses to solve unresolved crimes and store their outcomes.
system that could fix itself with a paramedic component. Cyberterrorism is intentional with specific target, not random, and according to the Center for Strategic and International Studies, only 8% of cybercrimes are discovereddbut not necessarily eradicateddbefore they occur. Take, for example, the Stuxnet worm, which took 2 years to program and was built intentionally for a specific target. According to the US FBI, cyberterrorism is any “premeditated, politically motivated attack against information, computer systems, computer programs, and data which results in violence against noncombatant targets by sub-national groups or clandestine agents.” Fig. 13.10 shows the architecture of malware data in binary and eventually migrating into DNA coded library. Malware is equivalent to devilware and its purpose is to unconditionally atomize everything in the city, including turning citizens into criminal cyborgs. The neocortex of the cyberterrorist is full of bright evil projects to convert a happy smart city into a wasteland full of agonized dreamers. Cyberterrorists know the value of DNA and know how to hack healthy genes and turn them into genetic bombs that will turn citizens against one another. To add more insult to injury, cyberterrorists will rob DNA data libraries and sell top secrets to adversaries.
376
Chapter 13 DNA data and cybercrime and cyberterrorism
FIGURE 13.10 To help the Smart City Law Enforcement, collecting crime data is a complex operation that takes time and specialized professional. The collected data have to be classified by crime records and correlation records. Deep artificial intelligence neural networks and machine learning will be used extensively to collect significant data. The contribution of DNA for storage and forensics is enormous.
But the most perplexing phenomenon in the age of the Internet is the following syndrome: How come malware thieves run faster than cyber defense wardens and outsmart them in every round? Well, it’s easier to destroy than it is to build. It took 10,000 workers 5 years to build the World Trade Center and 19 terrorists to bring the two towers down in 2 h. Cyberterrorists are loyal to themselves and are well connected with organizations on the Dark Web. Now cyberterrorisms is using artificial intelligence (AI) and nanotechnology to build weaponry with molecular nanobots and nano cyborgs. One of the most perilous areas in the malware continuum is in the Implantable Medical Devices devilware, which will be under the IoT umbrella. This is hacking human bodies in the most barbaric of all attacks. We have heart pacers, implantable defibrillators, diabetic pumps, cochlear pumps, infusion pumps, and neurostimulators. Most of these life-saving medical devices do not have security mechanisms in place.
DNA is the holy grail of smart city
377
FIGURE 13.11 The third generation of cyberterrorism will hit the world after 2025.
Fifteen demonic applications are riding on the bleeding edge of technology with no defense to protect city citizens or hospital patients. Fig. 13.11 shows the full spectrum of the evil continuum. Malware engineering is actually designed by highly skilled security gurus who know how to seal and camouflage their AI nano missiles like the horse of Troy.
DNA is the holy grail of smart city Now, DNA computing (DNAC) comes to the aid of the smart city, not only in biomedicine and genealogy but also in storing information with an astronomical capacity. Another “authoritative” advantage of DNA is that it can provide supportive evidence to law enforcement agencies in tracking criminals. DNA is the ax that gave the coup de graˆce to crime. DNA helped 80% of the criminals now in jail get there for life. And 5% of criminals get back their freedom and leave prison. These are impressive statistics and a victory to justice.
378
Chapter 13 DNA data and cybercrime and cyberterrorism
DNA will offer five pivotal benefits in supporting the quality of life in the smart city: The National Medical DNA collects an individual’s DNA, which can reflect his or her medical records and lifestyle details. Through recording DNA profiles, scientists may find out the interactions between the genetic environment and occurrence of certain diseases (such as cardiovascular disease or cancer) and thus find some new drugs or effective treatments in controlling these diseases. It is often collaborated with the National Health Service. The Centralized DNA Fingerprinting Database stores DNA profiles of individuals that enable searching and comparing of DNA samples collected from a crime scene against stored profiles. The most important function of the forensic database is to produce matches between the suspected individual and crime scene biomarkers and then provide evidence to support criminal investigations and also lead to identify potential suspects in the criminal investigation. The Smart City Citizens Database stores the complete demographic data, history of employment, insurance, family, religion, ethnic background, political background, and immigration. The Smart City Citizen Violence and Crime Database stores records of violations, family legal and court cases, and any cases of violence at work and at home. It stores law enforcement arrests due to family abuse and addiction to drugs and alcohol and crimes against children. Also, complete prison terms and judgments are stored. It also includes DNA and natural fingerprints, blood samples, psychiatric, and medical treatments. The Smart City DNA Operations Storage and Archive stores all the operational transactions of the city. DNA has proven to be the storage of the future. DNA has the ability to store enormous amounts of binary (computer) data. For example, we can store 42 million USBs in 1 gram of DNA. In the next decade, DNA will be the primary storage device to store binary data. The central dogma is the term that refers to DNA making proteins as illustrated in Fig. 13.12. DNA stores the genetic information in sequence, like a bookshelf or a hard drive, until it is time to start making protein. To do that, DNA goes through two physiological transformational steps: transcription and translation. In DNA, each protein is encoded by a gene, and the sequence of DNA’s building blocks specifies the order and types of amino acids that must be put together to make a protein.
FIGURE 13.12 The central dogma is a term that was coined by Dr. Francis, which shows the flow of genetic information from DNA to RNA to protein.
Appendices
379
In the transcription process (first phase), the master blueprint is DNA, which contains all the information to build the new protein (house). The working copy of the master blueprint is called messenger RNA, which is copied from DNA. The working copy of the blueprint (mRNA) must now go to the construction site where the workers will build the new protein. The construction workers transfer a copy of the master blueprint to create amino acids, which is the material needed to make proteins. In the translation process (second phase), construction of the house continues. Once the working copy of the blueprint has reached the site, the workers start assembling the materials according to the blueprint instructions. The amino acid instructions make proteins by converting codons through a special genetic code table into another structure which is ready to make the proteins. In the protein creation process (third phase), the assembly line of building material gets the chemical conversion in protein.
Appendices Appendix 13.A The human cell atlas Excerpts are taken from the Human Cell Atlas Consortium October 18, 2017. https://www.humancellatlas.org i files i HCA_WhitePaper_18 Oct 2017. The HCA will be made up of comprehensive reference maps of all human cellsdthe fundamental units of lifedas a basis for understanding fundamental human biological processes and diagnosing, monitoring, and treating disease. It will help scientists understand how genetic variants impact disease risk, define drug toxicities, discover better therapies, and advance regenerative medicine. Smart cities can join the HCA Consortium to have access to the citizens. The input atlas must characterize healthy samples that capture as much genetic, geo-graphic, environmental, and age diversity as possible. Public engagement throughout the course of the research will be essential to achieve the goals of the HCA. The research will better thrive with public support and involvement, and patients and the public should be engaged in all aspects of the HCA in a sustained manner. Thus, the HCA community must empower an ongoing dialogue between researchers, funders, patients, and the public. Because the HCA will be built over several years, it will require a public engagement strategy that evolves with the project, taking stakeholders and future beneficiaries on a journey of discovery and debate as the research progresses. For any project of the magnitude and ambition of the HCA, the general public must be considered a target stakeholder community. An important aspect is making the fundamental principles and motivations of the project as accessible as possible, both via major media outlets and through social media. The HCA will enable scientists to visualize and characterize individual cell types within different organs and tissues of the body. A portal to HCA data for the general public that uses a “Google Maps” approach to free exploration would help forge this connection. The HCA will provide smart cities citizens with information about all areas of human biology, from the taxonomy of cells and histological tissue structure to developmental biology and cell fate and lineage to physiology and homeostasis and their underlying molecular mechanisms. With corresponding atlases of model organisms that facilitate functional assessment, the HCA will allow us to better understand how faithful our models are to human physiology and pathology and to validate findings through perturbation.
380
Chapter 13 DNA data and cybercrime and cyberterrorism
HCA will be able to connect with other crime data on the criminals and provide the law enforcement agencies in the smart city with a genetic profile on criminals. Some of the outstanding characteristics of the HCA database: • • • • •
• • •
Transparency and open data sharing: Data will be released as soon as possible after it has been collected so it can be used immediately. Quality: The HCA community will be committed to producing the highest-quality data and establishing rigorous standards, shared openly and broadly and updated regularly. Flexibility: The HCA community will maintain intellectual and technical flexibility, so it can revise the design of the HCA as new insights, data, and technologies emerge. Community: The HCA community will remain global, open, and collaborative, led by a scientific steering group. It will remain open to all interested participants who are committed to its values. Diversity, inclusion, and equity: The selection of tissue samples will reflect geographic, gender, age, and ethnic diversity. Similar diversity will be reflected in the distribution of participating researchers, institutions, and countries. Privacy: We are committed to ensuring privacy of research subjects, consistent with the consent of research participants. Technology development: The HCA community will develop, adopt, and share new tools to empower others. Computational excellence: The HCA community will develop new computational methods, leveraging and driving the latest algorithmic advances, and share these through scaled, opensource software.
The HCA will help answer fundamental questions in all aspects of biology as well as serve as a guide to unravel the secrets of human disease. The general areas of medical impact include the following: •
•
•
•
Genes to drugs: The cell atlas will enable researchers to identify the cell types in which a given genetic variant acts, thus helping to pursue therapeutic targets identified by genetic studies of disease. For example, analyzing tens of thousands of neurons in the retina revealed new subtypes that eluded neuroscientists before, which can help us find in which cells the genes important in blindness actually act. Regenerative medicine: An atlas of cell types that are lost in disease will enable efforts to generate such cells faithfully. Similarly, an atlas of healthy human tissues and the matching organoids or in vitro differentiated cells will help determine if the engineered samples faithfully represent normal tissue composition and identify ways to complete any missing components. For example, efforts are underway to produce dopamine neurons in vitro or, alternatively, to reprogram cells in vivo into dopamine-producing neurons to treat Parkinson’s disease; an atlas of cell types will pinpoint characteristics that must be programmed into these cells for them to succeed. Disease mechanisms: Because the cell atlas will provide detailed maps of cells and their roles in tissues, researchers will be able to understand the mechanisms underlying any disorder at both the cell and the cellular ecosystem level. For example, an atlas of the small intestine will help map the cell of action for genes associated with Crohn’s disease, food allergy, obesity, and colon cancer. Drug discovery: The cell atlas will provide guidance as to which gene signatures to pursue in drug screens to represent desired cell phenotypes. For example, it can give us a molecular map of which genes and signatures drive cell development and how it goes awry in, say, cancer and provide targets for drug discovery.
Appendices
•
•
•
381
Toxicity: It will be possible to determine where else in the body a gene is expressed, helping to identify potential off-target effects before drug trials. For example, a cell atlas will help CAR-T immunotherapy cell developers ensure that the cells do not inadvertently target healthy essential cells that express the same gene or that drugs will not have off-target effects in other tissues (for example, causing blindness by targeting genes expressed in the retina). Drug efficacy and resistance: The atlas will provide the tools necessary to understand why drugs workdor don’tdat the level of cells and tissues, both before and after treatment. For example, a “cellular ecosystem map” that identifies both target cell types and target molecules of immunotherapy will help predict and monitor tumor response and provide new leads for immune modulation in resistant patients. Diagnostics: Knowledge of all the cell types in the body and their role in disease will enable updated and much more powerful versions of common diagnostic tools, such as the complete blood count (CBC) and next-generation biopsy. For example, the CBC, a census of a limited number of blood components that is used in a variety of diagnostic settings, could be supplemented by a “CBC 2.0” that would provide a high-resolution picture of the nucleated cells in, for example, blood disorders, infectious disease, autoimmune disease, and cancer. Tissue biopsies from patients could also be analyzed with unprecedented resolution.
Appendix 13.B DNA computing explained DNAC (molecular computing) is the new technology that uses biological molecules, rather than traditional silicon chips for its computation. The idea that individual molecules (or even atoms) could be used for computation dates to 1959, when American physicist Richard Feynman presented his ideas on nanotechnology. However, DNAC was not physically realized until 1994, when American computer scientist Leonard Adleman, at the University of Southern California, showed how molecules could be used to solve a computational problem. He demonstrated the first DNA computer by solving a simple example of what is known as the traveling salesman problem. https://www.britannica.com/ technology DNA-computing. A computation may be thought of as the execution of an algorithm, which itself may be defined as a step-by-step list of well-defined instructions, which takes some input, processes it, and produces a result. In DNAC, information is represented using the four-character genetic alphabet (A [adenine], G [guanine], C [cytosine], and T [thymine]), rather than the binary alphabet (1 and 0) used by traditional computers. This is achievable because short DNA molecules of any arbitrary sequence may be synthesized to order. An algorithm’s input is therefore represented (in the simplest case) by DNA molecules with specific sequences, the instructions are carried out by laboratory operations on the molecules (such as sorting them according to length or chopping strands containing a certain subsequence), and the result is defined as some property of the final set of molecules (such as the presence or absence of a specific sequence).
The experiment Adleman’s experiment involved finding a route through a network of “towns” (labeled “1” to “7”) connected by one-way “roads.” The problem specifies that the route must start and end at specific towns and visit each town only once. (This is known to mathematicians as the Hamiltonian path
382
Chapter 13 DNA data and cybercrime and cyberterrorism
problem, a cousin of the better-known traveling salesman problem.) Adleman took advantage of the Watson-Crick complementarity property of DNAdA and T stick together in pairwise fashion, as do G and C (so the sequence AGCT would stick perfectly to TCGA). He designed short strands of DNA to represent towns and roads such that the road strands stuck the town strands together, forming sequences of towns that represented routes (such as the actual solution, which happened to be “1234567”). Most such sequences represented incorrect answers to the problem (“12324” visits a town more than once, and “1234” fails to visit every town), but Adleman used enough DNA to be reasonably sure that the correct answer would be represented in his initial pot of strands. The problem was then to extract this unique solution. He achieved this by first greatly amplifying (using a method known as polymerase chain reaction) only those sequences that started and ended at the right towns. He then sorted the set of strands by length (using a technique called gel electrophoresis) to ensure that he retained only strands of the correct length. Finally, he repeatedly used a molecular “fishing rod” (affinity purification) to ensure that each town in turn was represented in the candidate sequences. The strands Adleman was left with were then sequenced to reveal the solution to the problem. Adleman wanted to explore the possibility of starting DNAC with molecules. Right after the publication, his experiment was presented, an ardent competition started between DNA-based computers and their silicon-based computer companies. It is evident that molecular computers will one day solve problems that would cause existing machines to struggle, due to the inherent massive parallelism of biology. A Small drop of water can contain trillions of DNA strands and because biological operations act on all of themdeffectivelydin parallel (as opposed to one instruction at a time). Technology trajectory shows that soon DNA computers will have several concrete advantages over our present black boxes, namely in storage, time, and parallel execution. DNA computers are still facing lots of challenges and chuck holes along the way to cost/benefit. In the Hamiltonian path problem, for example, the number of vertices multiplies exponentially and consumes a large quantity of DNA. Adleman’s experiment opened up new frontiers in computing. Let us not forget that quantum computing is another tough competitor and it will coexist and complement DNA computer.
Biochemistry-based information technology This is amazing! Computer scientists at Caltech announced in March 21, 2019 in Nature magazine that they have designed, for the first time, a DNA molecular computer that can build self-assembly algorithms to run any hardware and software, as our conventional computers. Let me explain this miraculous invention: as we know, DNA is a chemical computer running in a test tube at the nanoscale level. The system assembles the logic circuits from DNA strands stored in a strand library and then builds the algorithm and executes it on the logic circuits. The system is analogous to a computer, but instead of using transistors and diodes, it uses molecules to represent a six-bit binary number. Starting with the original six bits that represent the input, the DNA system adds row after row of moleculesd progressively running the built algorithm. Modern digital electronic computers use electricity flowing through circuits to manipulate information; here, the rows of DNA strands sticking together perform the computation. Dr. Damien Woods, professor of computer science at Maynooth University near Dublin, Ireland, eloquently stated “The ability to run any type of software program without having to change the hardware is what allowed computers to become so useful.” The final result is a test tube
Biochemistry-based information technology
383
filled with billions of completed algorithms, each one resembling a knitted scarf of DNA, representing the output of the computation. The arrangement of scarfs gives you the solution to the algorithm that you were running. The system can be set up to run a different algorithm by simply selecting from the strand library a different configuration of strands. The library has over 800 strands ready to be used for computation.
Real case of chemical DNA computer Scientists and engineers from BAE Systems (a global defense, aerospace, and security company) and the University of Glasgow envisage that small unmanned air vehicles bespoke to military operations could be ‘grown’ in large-scale labs. Military technology is galloping with leaps and bounds to transcend the sophistication of weaponry to the highest level. By 2100, military technology will stay away from conventional arms and launch an asymmetrical strategy to build with minimal human assistance complex systems that defy imagination. DNA future of warfare is chemical and biochemical in a very unconventional way. As we get closer to 2100, scientists believe it will be possible to “grow” drones and military aircraft from chemical compounds, using the DNA-centric chemputers, as shown in Fig. 13.13. Additional information on chemputer can be found at https://www.3ders.org/articles/20160704-bae-systems-revealsplans-for-chemputer-3d-printer-that-chemically-grows-military-drones.html.
Appendix 13.C Glossary for smart city (MERIT cybersecurity engineering) Adeno-associated virus (AAV): A small, stable virus that is not known to cause disease in humans. The naturally occurring form of the virus has only two genes, which are removed in the construction of AAV vectors for gene delivery. Neither AAV nor AAV vectors have been known to induce an immune response.
FIGURE 13.13 BAE Systems reveals plans for chemputer 3D printer that chemically grows military drones. This technology is extremely disruptive and will revolutionize attack weaponry.
384
Chapter 13 DNA data and cybercrime and cyberterrorism
Adenovirus: A virus that causes clinical conditions such as the common cold and respiratory infections. Advanced metering infrastructure (AMI): It is another term for smart metersdelectricity meters that automatically measure and record usage data at regular intervals and provide the data to consumers and energy companies at least once daily. The opposite is automated meter readingd meters that collect data for billing purposes only and transmit this data one way, usually from the customer to the distribution utility. Agarose gel electrophoresis: A technique used to separate DNA fragments (and proteins) by their size. An electric current is used to propel the DNA (or proteins) through a porous gel matrix. Allele: A particular sequence variation of a gene or a segment of a chromosome. Alternative splicing: The processing of an RNA transcript into different mRNA molecules by including some exons and excluding others. Amino acid: The basic building block of a protein. There are 20 different amino acids commonly found in proteins. The genetic code specifies the sequence of amino acids in a protein. Amplification: The repeated copying of a DNA sequence. Annealing: The hydrogen bonding between complementary DNA (or RNA) strands to form a double helix. Annotation: The process of locating genes, their coding regions, and their functions. Anonymized data: Data that cannot be traced back to their donor. Anticodon: A 3-base sequence in a tRNA molecule that base-pairs with its complementary codon in an mRNA molecule. Application programming interface: It is a set of definitions, protocols, and tools for building application software. Apps: Apps (application programs) are computer programs designed to perform a group of integrated activities for the benefit of the user. Artificial intelligence: It is intelligence exhibited by machines, rather than humans or other animals. Assembly: Putting sequenced fragments of DNA into their correct order along the chromosome. Automated meter reading (AMR): It is a technology used in utility meters for collecting the data that are needed for billing purposes. AMR, which works by translating the movement of the mechanical dials on a meter into a digital signal, does not require physical access or visual inspection. The data can be transmitted from the meter to the utility company by telephone, power line, satellite, cable, or radio frequency. Autonomous vehicles: They are vehicles capable of sensing their environment and navigating without human input. Autosome: Any of the numbered chromosomes that are not involved with sex determination. Bacterial artificial chromosome: A chromosome-like structure constructed using recombinant DNA technology. It is used to clone large DNA inserts (100e300 kb) into Escherichia coli cells. Base: One of the five chemicals (also called nitrogenous bases) found in nucleic acids (DNA and RNA): adenine, thymine, guanine, cytosine, and uracil. Base pair: Two nitrogenous bases (adenine and thymine/uracil or guanine and cytosine) held together by weak hydrogen bonds. Base pair substitution: A type of mutation where one base pair is replaced with a different one; also called a point mutation.
Biochemistry-based information technology
385
Big data: It refers to the collection of data sets that are so large and complex that it’s difficult to capture, transfer, store, process, and interpret with traditional data processing applications. It allows for rich information to be derived on a range of variables such as real-time traffic conditions, air pollution, and energy use. Bioethics: The study of ethical issues raised by the developments in the life science technologies. Bioinformatics: The study of collecting, sorting, and analyzing DNA and protein sequence information using computers and statistical techniques. BLAST (Basic Local Alignment Search Tool): A computer program that searches for sequence similarities. It can be used to identify homologous genes in different organisms. Candidate gene: A gene that is suspected of being associated with a particular disease. Carrier: A person who is heterozygous for a mutation associated with a genetic disease. Usually, a carrier does not display symptoms of the disease but may pass the mutation on to offspring. cDNA (complementary DNA): A DNA molecule synthesized from an mRNA molecule. They can be used experimentally to determine the sequence of an mRNA. Cellular: A cellular network or mobile network is a communication network where the final link is wireless. Centromere: The compact region near the center of a chromosome. Chromosome: A rod-like structure found in the cell nucleus. It consists of one long DNA molecule with its associated proteins. City-as-a-service: Combines Infrastructure-as-a-Service (IaaS) and Software-as-a-Service (SaaS) technologies for use as a common, city-wide platform for the deployment of integrated smart city technologies. Think: operating system for the city. Clone: (1) A genetically identical copy of an individual cell or organism; (2) an exact copy of a DNA sequence. Cloning vector: A DNA molecule, such as a modified plasmid or virus, which can be used to clone other DNA molecules in a suitable host cell. Cloning vectors must be able to replicate in the host cell and must possess restriction enzyme cut sites that allow the DNA molecules targeted for cloning to be inserted and retrieved. Cloud computing: Cloud computing is an information technology model for enabling ubiquitous access to shared pools of data and computing resources, typically over the Internet. Coding DNA or region: A sequence of DNA that is translated into protein; also called exons (in eukaryotes). Codon: A three-base sequence in a DNA or mRNA molecule that specifies a specific amino acid or termination signal; the basic unit of the genetic code. Combined DNA Index System (CODIS): A database maintained by the FBI. It includes DNA profiles of convicted offenders in the United States. Comparative genomics: The process of learning about human genetics by comparing human DNA sequences with those from other organisms. Connected devices: A connected device (or smart device) is an electronic device, generally connected to other devices or networks, that can operate to some extent interactively and autonomously. Connectivity: It is the ability of individuals and devices to connect to communications networks or the Internet and to access services such as email and the World Wide Web. Consanguineous: Marriage or mating among related individuals.
386
Chapter 13 DNA data and cybercrime and cyberterrorism
Conserved sequence: A DNA (or amino acid) sequence that has remained relatively unchanged throughout evolution. Such a sequence is under selective pressure and therefore resistant to change. Contig: Contiguous sequence of DNA created by assembling shorter, overlapping sequenced fragments of a chromosome (whether natural or artificial, as in BACs). A list or diagram showing an ordered arrangement of cloned overlapping fragments that collectively contain the sequence of an originally continuous DNA. Cosmid: A cloning vector derived from a bacterial virus. It can accommodate about 40 kb of inserted DNA. Cycle sequencing: A DNA sequencing technique that combines the chain termination method developed by Fred Sanger with aspects of the polymerase chain reaction. Data set: A data set is a collection of data. Deletion: A type of mutation caused by the loss of one or more adjacent base pairs from a gene. Deoxyribose: The five-carbon sugar component of DNA. It has one less hydroxyl group than ribose, the sugar component of RNA. Dideoxynucleotide (ddNTPs): Synthetic nucleotides lacking both 20 and 30 hydroxyl groups. They act as chain terminators during DNA sequencing reactions. Directed sequencing: Successively sequencing DNA from adjacent stretches of chromosome. DNA (deoxyribonucleic acid): The hereditary material that exists as a double-stranded helical molecule made up of a deoxyribose sugar phosphate backbone and the four nitrogenous bases, named adenine (A), cytosine (C), guanine (G), and thymine (T). DNA chip: A microarray of oligonucleotides or cDNA clones fixed on a surface. They are commonly used to test for sequence variation in a known gene or to profile gene expression in an mRNA preparation. DNA ligase: An enzyme able to form a phosphodiester bond between adjacent but unlinked nucleotides in a double helix. DNA polymerase: An enzyme that adds bases to a replicating DNA strand. DNA probe: A chemically synthesized, often radioactively labeled, segment of DNA used to visualize a genomic sequence of interest by hydrogen bonding to its complementary sequence. DNA replication: The process of replicating a double-stranded DNA molecule. Dynamic pricing: Dynamic pricing, also referred to as demand-based pricing, is a pricing strategy in which flexible prices for products or service are based on current market demands. Electric vehicles: Electric vehicles completely rely on one or more electric motors for propulsion. Gene: A region of DNA that can encode one or more polypeptides or RNA products. Gene expression: The process by which a gene is transcribed into RNA and then translated into a protein. Genetic code: The mapping between the set of 64 possible three-base codons and the amino acids or stop codons specified by each of the triplets. Genome: The complete DNA content of an organism. Genomics: The comprehensive study of whole sets of genes and their interactions rather than single genes or proteins. Information and communications technology (ICT): ICT refers to the integration of telecommunications, computers, and associated enterprise software, middleware, storage, and audio-visual systems that enable users to access, store, transmit, and manipulate information.
Biochemistry-based information technology
387
Infrastructure: Infrastructure refers to the fundamental facilities and systems serving a city, country, or other area including the services and facilities necessary for its economy to function. Instrumentation: Instrumentation is a collective term for measuring instruments used for indicating, measuring, and recording physical quantities. Integrated services: In computer networking, integrated services is an architecture that specifies the elements to guarantee quality of service on networks. Interoperability: It is a characteristic of a product or system whose interfaces are able to work seamlessly with a defined set of other products or systems. Kilowatt hour (kWH): It is the amount of energy you get from 1 kilowatt for 1 h. Electricity use over time is measured in kilowatt hours. Your electric company measures how much electricity you use in kilowatt hours, abbreviated as “kWh.” A kilowatt is a unit of power equal to 1000 W. Legacy systems: In computing, legacy systems are old and outdated methods, technologies, computer systems, or application programs. Livability: Quality of life is the general well-being of individuals, communities, and societies. LoRa: A long-range, low-power wireless platform that is the prevailing technology choice for building Internet of Things networks worldwide. Low-power wide-area network: A type of wireless telecommunication wide area network designed to allow long-range communications at a low bit rate among connected objects. Multimodal: Transportation systems include a wide range of transportation options including walking, bicycling, bus, light rail, train, ferry, and shared mobility services. Near-field communication: A set of communication protocols that enable two electronic devices, one of which is usually a portable device such as a smartphone, to establish communication by bringing them within 4 cm of each other. Off-peak hours: Those hours or other periods defined by NAESB business practices, contract, agreements, or guides as periods of lower electrical demand. On-peak hours: Those hours or other periods defined by NAESB business practices, contract, agreements, or guides as periods of higher electrical demand. Open data: Data that are freely available to everyone to use and republish as they wish, without copyright, patent, or other restrictions. Open standards: Publicly available standards developed through a broad consultation process that govern the application of a particular domain or activity. Optimization: The process of achieving the best possible outcome relative to a defined set of success metrics. Outage: The period during which a generating unit, transmission line, or other facility is out of service. Peak load: The maximum load during a specified period of time. Platform: A computing platform is the environment within which a piece of software is executed. Platform as a service: It is a category of cloud computing services that provides a platform allowing customers to develop, run, and manage applications. Predictive analysis: The use of statistical techniques such as predictive modeling, machine learning, and data mining to analyze data and make predictions about the future. Predictive analytics: Predictive analytics includes a range of statistical techniques from predictive modeling, machine learning, and data mining that analyze existing data to make predictions about future events.
388
Chapter 13 DNA data and cybercrime and cyberterrorism
Privacy: Privacy in a smart city context is the ability of an individual or group to control the types, amounts, and recipients of data about themselves. Real time: Real-time computing describes hardware and software systems that are able to respond very rapidly to continuously occurring external events. Reference architecture: A reference architecture in the field of software architecture provides a template solution valid for a particular domain that can be used again and again. Responsibilities: City responsibilities are key domains within which city governments and their private sector partners provide important services to residents, visitors, and businesses. RF mesh: An RF mesh network is a wireless communications network made up of radio frequency nodes organized in a flexible mesh topology driven by connections between neighboring nodes. Ride hailing services: Arrange one-time rides on very short notice, typically using a dedicated app. Security: In a computer science context, it is the ability to maintain the integrity of all data, software, hardware, and devices against unauthorized actors. Sensor: An electronic component, module, or subsystem whose purpose is to detect events or changes in its environment. Shared transportation: A term for describing a demand-driven vehicle-sharing arrangement in which travelers share a vehicle either on demand or over time. Siloed cities: Have poor integration between different city responsibilities, across departments, among communication networks, and with other regional governments. Situational awareness: The perception of environmental elements and events, the comprehension of their meaning, and the understanding of their status after some variable has changed. Smart cities: A city that harnesses digital technology and intelligent design to create a sustainable city where services are seamless, are efficient, and provide for a high quality of life for citizens. A smart city uses information and communications technology to enhance livability, workability, and sustainability. Smart devices: Typically, an electronic device, generally connected to other devices or networks via different protocols such as Bluetooth, NFC, Wi-Fi, 3G, etc., that can operate to some extent interactively and autonomously. Smart grid: The “grid” refers to our nation’s electric power infrastructure. Smart grid is the application of information technology, tools, and techniques like smart meters, sensors, real-time communications, software, and remote-controlled equipment to improve grid reliability and efficiency. Smart home: The integration of a smart meter along with Wi-Fi-enabled appliances, lighting, and other devices that conveniently change the way a family interacts with its home and optimizes home energy consumption. Smart infrastructure: The integration of smart technologies into the fundamental facilities and systems serving a city, country, or other area including the services and facilities necessary for its economy to function. Smart meter: A common form of smart grid technology are the digital meters that replace the old analog meters used in homes to record electrical usage. Digital meters can transmit energy consumption information back to the utility on a much more frequent schedule than analog meters, which require a meter reader to transmit information. It is an electronic device that records consumption of electricity in intervals of an hour or less and communicates that information back to the utility for monitoring and billing.
Suggested readings
389
Smart networks: A network that contains built-in diagnostics, management, fault tolerance, and other capabilities to prevent downtime and maintain efficient performance. Smart parking: A vehicle parking system that helps drivers find a vacant spot using sensors and communications networks. Smart street lighting: Street lights that can be controlled wirelessly to save energy and reduce maintenance costs. The wireless network controlling street lighting can also be expanded to connect sensors that gather data on weather conditions, air pollution, and more. Smart transportation: Aims to provide innovative services relating to different modes of transportation and traffic management and enable various users to be better informed and make safer, more coordinated, and better use of transportation networks. Smart waste: Waste receptacles, such as city litter bins and commercial waste bins, equipped with connected sensors that collect and share data on, for example, the need for and frequency of waste collections. Structured data: Follows an abstract model that organizes elements of data and standardizes. Switches: Electrical components that can “make” or “break” an electrical circuit, interrupting the current or diverting it from one conductor to another. Unstructured data: Information that does not have a predefined data model. Urban data platform: An urban data platform provides a common digital environment for the aggregation of data across multiple city responsibility areas and departments. Urbanization: The population shift from rural to urban areas, the gradual increase in the proportion of people living in urban areas, and the ways in which each society adapts to the change. Workability: Economic competitiveness, which can be measured by productivity, innovation, and openness.
Suggested readings Change in DNA for Smart City. http://www.govtech.com/products/Change-Is-in-the-DNA-of-All-Smart- CityInitiatives.html. CISCO DNA for Cities. www.cisco.com/c/en/us/solutions/enterprise-networks/dna-cities.html CISCO".title¼".https://www.cisco.com/c/en/us/solutions/enterprise-networks/dna-cities.html.CISCO">, https://www. cisco.com/c/en/us/solutions/enterprise-networks/dna-cities.html. CISCO Global Cloud Index. https://www. cisco.com/c/en/us/solutions/service-provider/global-cloud-index-gci/white-paper-listing.html. Cybersecurity Ventures Statistics. https://cybersecurityventures.com/cybersecurity-almanac-2019/. DNA of Cities. “Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again first ed.”, Basic Books, first ed., (March 12, 2019). http://www.govtech.com/products/Change-Is-in-the-DNA-of-All-SmartCity-Initiatives.html Eric Topol. Glossary for Smart City (Courtesy MERIT CyberSecurity Library). www.termanini.com. Human Cell Atlas. https://www.humancellatlas.org/news/13. Jean-Michel Claverie and Cedric Notredame, 2003. Bioinformatics. Wiley Publishing. Linkner, J., 2019. Hacking innovation: the new growth model. In: From the Sinister World of Hackers Hardcover e January 24, 2017. Fastpencil Publishing. Marine Traffic. https://www.marinetraffic.com/en/ais/home/centerx:-72.7/centery:39.8/zoom:6. Netter, M.D., Frank, H., April 7, 2014. Atlas of Human Anatomy, sixth ed. Saunders. Norbert Wiener Book Cybernetics. http://citeseerx.ist.psu.edu/viewdoc/download?doi¼10.1.1.644.2262&r ep¼rep1&type¼pdf.
390
Chapter 13 DNA data and cybercrime and cyberterrorism
Ray Kurzweil. https://www.kurzweilai.net/the-singularity-is-near. Smart City Longevity. https://www.rsm.nl/fileadmin/Images_NEW/ECFEB/pdf/Stadhouder 2017 Determinants_ of_Longevity_of_Smart_City_Innovation_Ecosystems_and_projects.pdf. Storage cost. https://cienciaencanoa.blogspot.com/2015/02/mad-scientists-at-mit-are-designing.html. Storage story. https://spectrum.ieee.org/computing/hardware/why-the-future-of-data-storage-is-still-magnetictape. The challenge of paying for smart cities projects. https://www2.deloitte.com/content/dam/Deloitte/global/ Documents/Public-Sector/gx-ps-the-challenge-of-paying-for-smart-cities-projects1.pdf. The Human Cell Atlas The Future of Data Storage. https://spectrum.ieee.org/computing/hardware/why-thefuture-of-data-storage- is-still-magnetic-tape. The World Internet Statistics. www.internetworldstats.com/.
CHAPTER
DNA is a time storage machine for 10,000 years
14
The only reason for time is so that everything doesn’t happen at once. dAlbert Einstein.
The World Wide Web went from zero to millions of pages in a few years. Many revolutions look irrelevant before they change everything. dGeorge Church Interview with Popular Science.
It’s the fastest thing I’ve seen yet. It’s like you throw a piston into a car and it finds its way to the right place and swaps out with one of the other pistonsewhile the motor’s running. dGeorge Church talking about CRISPR.
DNA is Man’s Health and Mortality Crystal Ball . In Church’s eyes, the world is a place where DNA is the ultimate computer code and we are all computer programmers.
A special genre of time machine Many dramatic claims about DNA have been made by leading genomics scientists who have called the genome a “Delphi coracle,” “a time machine,” “a trip into the future,” and “a medical crystal ball.” It is a “Bible,” the Book of Man, and the Holy Grail. The former director of the US Human Genome Project and Nobelist James Watson has proclaimed that DNA is what makes us human and that “our fate is in our genes.” Some scientists have even claimed that our genesdor our ability to control themdcan lead us to a land free of illness, crime, uncertainty, or psychic distress. One geneticist promises that with the help of gene therapy, “present methods of treating depression will be seen as crude as former pneumonia treatments seem now.” And a biologist and science editor, describing acts of violence, suggested that “when we can accurately predict future behavior (with genetic analysis), we may be able to prevent ‘such behavior.’” Meanwhile, a molecular geneticist of our acquaintance, Dr. Ha, in a casual conversation, happily predicted a future system of infant assessment of all newborns. Ha said, “parents will 1 day be given a printout of their genome, along with some frank advice about their talents, deficits, and ideal career choices”. Early humans relied on word-of-mouth storiesdpassed on from generation to generationdto learn what the world looked like in years past. Once invented, writing became the “time machine” of choice; Aristotle’s words help us see what people thought more than 2000 years ago, for example. But what if Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00014-8 Copyright © 2020 Elsevier Inc. All rights reserved.
391
392
Chapter 14 DNA is a time storage machine for 10,000 years
we want to know what life was like before written text, before the beginning of modern oral history? How can we see the world that existed more than 50,000 years ago? For this, we turn to the stories contained within our DNA. At its core, genetics is a historical discipline. Mutations are passed on from generation to generation and accumulate as a result of chance as well as of selection within and between populations and species. Over 30 years ago, the first evidence that DNA could survive in dead organisms emerged. Polymerase chain reaction (PCR) is a method widely used in molecular biology to make many copies of a specific DNA segment. Using PCR, a single copy of a DNA sequence is exponentially amplified to generate thousands to millions of more copies of that particular DNA segment. It became possible to reproducibly retrieve old DNA sequences. This resulted in novel insights about the relationships of extinct animals and in the determination of the first DNA sequences from a Neandertal, the closest extinct relative of present-day humans.
DNA time clock can predict when we will die A few years ago, scientists discovered that certain chemical changes to our DNA that accumulate over time can be used to predict our age. Now, taking this one step further, researchers have found that the difference between this estimated age and our actual, chronological, age can be used as a kind of biological clock to predict our life span (https://www.iflscience.com/health- and-medicine/ recently-identified-dna-clock-helps-predict-mortality/). Unfortunately, even after taking a variety of different factors into consideration, the researchers found that if a person’s estimated age is higher than his or her chronological age, then he or she is more likely to die sooner than individuals whose ages match up. The DNA modifications that the researchers used to estimate an individual’s age are a type of epigenetic change. These are changes that result in alterations in gene expression, such as switching genes on or off, without actually modifying the DNA sequence itself, like a mutation. In this case, the researchers were looking at DNA methylation, which involves the addition of a chemical tag at certain sites along the sequence. It is well known that the degree of DNA methylation (considered as aging criteria) changes as we age and that levels can be influenced by lifestyle, environmental, and genetic factors. Cigarette smoking and acute stress, for example, have been shown to alter DNA methylation. Earlier studies found that DNA methylation can be used to estimate an individual’s age and that differences between this predicted age and actual chronological age may be associated with risk for age-related diseases and mortality. However, no studies have investigated whether DNA methylation is strongly linked to an individual’s life span. As described in Genome Biology magazine in November issue 2019, (https://genomebiology. biomedcentral.com/articles/10.1186/s13059-019-1824-y), researchers found that having a DNA methylation age 5 years higher than your actual, or chronological, age is associated with a 21% higher risk of mortality, or death from all causes, even when age and sex were taken into consideration. However, if researchers considered a variety of other environmental and lifestyle factors, such as education, smoking, diabetes, cardiovascular disease, and social class, the increased risk of mortality was reduced to 16%.
The telomere story
393
The link between biological clock and mortality The study suggests there a link between “biological clock” and mortality, but further investigation is needed determine which specific elements such as environmental, genetic, or lifestyle play a big role in the person’s biological longevity. Researchers have already started studying the link between the biological clock and mortality, and the results will be published in medical journals. Professor Ian Deary of Differential Psychology at the University of Edinburgh said in a news release, “This new research increases our understanding of longevity and healthy aging, . It is exciting as it has identified a novel indicator of aging, which improves the prediction of lifespan over and above the contribution factors such as smoking, diabetes and cardiovascular disease.” In Greek mythology, the amount of time a person spent on the earth was determined at birth by the length of a thread spun and cut by fate. Modern genetics suggests the Greeks had the right idead particular DNA threads called telomeres have been linked to life expectancy. But new experiments are unraveling old ideas about fate.
The telomere story The DNA that makes up our genes is entwined in 46 chromosomes, each of which ends with a telomere, a stretch of DNA that protects the chromosome, like the plastic tip on a shoelacedas shown in Fig. 14.1. Telomeres are quite long at birth and shorten a bit every time a cell divides; ultimately, after scores of divisions, very little telomere remains, and the cell becomes inactive or dies.
FIGURE 14.1 Telomeres are the caps at the end of each strand of DNA that protect our chromosomes, like the plastic tips at the end of shoelaces. Without the coating, shoelaces become frayed until they can no longer do their job, just as without telomeres, DNA strands become damaged and our cells cannot do their job. MERIT CyberSecurity Engineering
394
Chapter 14 DNA is a time storage machine for 10,000 years
And because elderly people generally have shorter telomeres than younger people, scientists believe that telomere length may be a marker for longevity as well as cellular health. The link offers more detail in the subject of telomere and longevity (https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3370421/). Biomedical geneticists have revealed that human experiences can affect telomeres. This is a new revelation that nurture (environmental variables) has an influence on nature (genes and hereditary factors). Dr. Idan Shalev, Assistant Professor of Biobehavioral Health at Pennsylvania State University and telomere expert, analyzed DNA samples from children of different ages. He concluded that “We found that children who experience multiple forms of violence had the fastest erosion of their telomeres, compared with children who experienced just one type of violence or did not experience violence at all.” Telomere tests could be used to estimate not only how fast someone is aging but possibly how long he or she has left to live. Another study, conducted in September 2018 at Brigham and Women’s Hospital in Boston, hints at possible physical effects of chronic stress. Among a sample of 5243 nurses nationwide, those who suffered from phobias had significantly shorter telomeres than those who did not. According to Olivia Okereke, the study’s lead author, “It was like looking at someone who is 60 years old versus someone who is 66 years old.” “The telomeres are essential for protecting chromosome ends,” says Carol Greider, a molecular biologist at the Johns Hopkins University and a pioneer telomere researcher who was awarded a share of the 2009 Nobel Prize in Physiology or Medicine. “When the telomere gets to be very, very short, there are consequences,” she says, noting the increased risk of age-related ailments. While researchers are adding to the list of things that can shorten telomeres (smoking or infectious diseases, for example), they have also zeroed in on activities that seem to slow down telomere degradation. In a German study, people in their 40s and 50s had telomeres about 40% shorter than people in their 20s if they were sedentary, but only 10% shorter if they were dedicated runners. Telomeres are structures on the tips of all chromosomes, which gradually get shorter with age. Short telomeres are linked with premature aging and many diseases. By measuring telomere length, scientists can see how fast someone is aging and calculate their biological age. These data can be used to predict life expectancy. Scientists do not understand exactly how negative life experiences accelerate telomere erosiondor how positive behaviors stave it off. Additionally, outside of a few age-related diseases in which telomeres have been directly implicated, they are unable to say whether shorter telomeres cause aging or merely accompany it. But it is clear that the fates are not entirely in charge. According to the new science of telomeres, we can, to some extent, influence how much time we have.
Time travel is within reach In a new revelation, and one step on from fantastical thinking, a team from the University of British Columbia and the University of Maryland have said that time travel is possibledat least, theoretically. Whether they are right or not of course, only time will tell, but according to the scientists involved, there is no mathematical reason why a time travel machine could not disrupt the space time continuum enough to go backward and forward in time.
Time travel is within reach
395
The study, which was published in the journal Classical and Quantum Gravity, is titled “Traversable Acausal Retrograde Domains in Spacetime,” which spells TARDISdthe name of the doctor who is famous for his police box time machine. What a coincidence! In the paper, the mathematicians involved in the project have proposed a mathematical model for a viable time machine. “In this paper we present geometry which has been designed to fit a layperson’s description of a time machine,” they wrote. “It is a box which allows those within it to travel backwards and forwards through time and space, as interpreted by an external observer.” “People think of time travel as something fictional and we tend to think it’s not possible because we don’t actually do it. But, mathematically, it is possible,” said Ben Tippett, one of the lead researchers. Any time travel machine would probably need to be able to warp space time, which is the connection between time and physical dimensions such as width, depth, and height, and in the past, scientists have presented evidence that suggests that huge gravitational forces, such as those emitted by black holes, which we are taking a photo of soondanother crazy idea maybedcan slow down time. The team’s model is based on a similar idea that a strong force could disrupt space time and “bend time into a circle for passengers,” which would, in theory at least, mean people could travel faster than the speed of light and therefore travel through time. While building a time machine may be theoretically possible, the researchers said it was unlikely that anyone would ever be able to achieve it. Do I hear another fantastical idea forming? “While is it mathematically feasible, it is not yet possible to build a space time machine because we need materials, which we call exotic matter, to bend space time in these impossible ways, but they have yet to be discovered,” said Tippett. So, as I watch people line up to book their first flights into space, I am going to be sure to get my name on the list to travel back in time. I have always wanted to see dinosaurs, and I do not mean the ones that scientists are trying to bring back from the dead. Oh, You Need to Know Something about DNA Methylation (The Octane Catalyst). Simplistically, we can say the methylation is a catalytic process that changes the genes’ expression (production) to promote aging and diseases such as cancer (https://phys.org/news/2013-05-universalmethod-catalytic-methylation-amines.html). Recent studies have identified a group of individual methylation locations in the body where methylation changes the chronological aging genes. Methylation can help reduce aging by Methylation is a very dynamic and regimented process that happens trillions of times in every cell each minute. It is one of the most crucial metabolic functions of the body and is dependent upon a variety of enzymes. Adapting to stress and the challenges of life is an aspect that methylation provides the body. Without adequate methylation processes, we cannot effectively harmonize our living and consequently will suffer detrimental results of erratic aging. Methylation controls the product line of amino acids, enzymes, and DNA in every cell. It also regulates healing of the body’s tissues, cell energy, genetic production of DNA, liver detoxification, immunity, and neurology. Methylation is like the jack of all trades. In addition, we have telomere as an aging gauge as shown in Fig. 14.2. Methylation is involved in almost every bodily biochemical reaction, and it occurs billions of times every second in our cells. That is why if we determine what and where methylation is doing during its work, we will be able to guide it to offer its optimum, which will in turn have a positive impact on our health.
396
Chapter 14 DNA is a time storage machine for 10,000 years
FIGURE 14.2 Age-associated cognitive gene decline, and how DNA is fixing it. Expression changes and master switch driving age-related cognitive decline. Brain aging accompanies alteration in healthy neurons (left of the picture) leading to downregulation; the circles represent upregulation of genes belonging to multiple pathways. Epigenetic modifications particularly decrease in DNMT1 and increase in HDAC2 level might be the master regulators, and accordingly, epigenetic modifiers might prove ideal therapeutic targets.
Here is the job description of methylation listing all the functions, which are performed with equal priority: 1. 2. 3. 4. 5. 6. 7. 8.
Turns on and off genes (gene regulation)dthis is important in cancer, for example Processes chemicals and toxins (biotransformation) helping to reduce our toxic load Builds neurotransmitters (dopamine, serotonin, epinephrine) Processes and metabolizes hormones (estrogen) Builds immune cells (T cells and NK cells) Synthesizes DNA and RNA (thymine is formed from uracil) Produces energy (CoQ10, carnitine, and ATP) Produces protective coating on our nerves (via myelination)
DNA storage random access retrieval
397
The amazing storage phenomenon Throughout human history, people have come up with all sorts of data storage systemsdfrom cuneiform and chiseled inscriptions to hard drives and compact discs. But they all have one thing in common: At some point, they degrade, and tracks get corrupted or lost. We all know that technology gives, and technology takes. Data storage is not immune from aging. DNA storage is the new technology that promises new advantages that seem time independent. A few drops of DNA would be enough to store all the world’s music! This is hard to believe, but science behind storing data in DNA has been proven. The songs “Tutu” by Miles Davis and “Smoke on the Water” by Deep Purple have already made their mark on music history. Now they have entered the annals of science, for eternity. Recordings of these two legendary songs were digitized by the Ecole Polytechnique Fe´de´rale de Lausanne as part of the Montreux Jazz Digital Project. Here is the link on the subject matter: https://actu.epfl.ch/news/two-items-of-anthology-now-stored-for-eternity-in-/.
Twist Bioscience (synthetic DNA manufacturer) and its research partner (Microsoft and the University of Washington) encoded the recordings, transformed them into DNA strands, and then sequenced and decoded them and played them back without any loss in recording quality. This little project earned UNESCO heritage status and was preserved at the World Archive as DNA strands in a test tube to ensure its preservation for thousands of years. The amount of artificial DNA strands needed to record the two songs is invisible to the naked eye. The amount needed to record all 50 years of archives from the World Register, which would require the size of a grain of sand. We can state the fact that saving any kind of digital data (songs, speeches, pictures, videos) on DNA strands will last almost forever. Time would tell us thousands of years from now if our archive rode on the back of a time machine and survived any decay.
Storage evolution over time The migration from the old 1970 IBM tape drive 3420 to the solid-state drive is a monumental technological jump. Storage evolution is moving at break-neck speed. We can say that storage has moved from the rudimentary technology of ice age into a much hotter climate. In the 1970s, IBM ruled the computing industry. People used to say that, “What is good for IBM is good for the country.” IBM has lost its first position to other US and foreign companies. Even though so many companies are currently selling the cloud, it will not solve the exponential growth of data. Fig. 14.3 shows technology’s perennial behavior of producing new products.
DNA storage random access retrieval February 2018 marked another huge step in the bioinformatics industry. Microsoft, Washington University, and TWIST Biosciences published the first report of DNA-based random access memory (RAM) in Nature Biotechnology magazine. In this study, researchers not only archived recordbreaking volumes of data in DNA, but they also stored the data in a format that models your
398
Chapter 14 DNA is a time storage machine for 10,000 years
FIGURE 14.3 This is the chronology of magnetic/silicon storage technology going through birth and death periods following an exponential distribution function f(x) ¼ lelx. The length of the technology lifetime depends on how disruptive it is and the substantial improvement it provides (benefits/cost) over the preceding technology.
device’s RAM for the first time. In other words, accessing data is dynamic moving from one location to another by address. First-generation DNA storage used serial (byte by byte) recording. The entire pool of DNA blocks must be stored sequentially. This is similar to a tape player where data are recorded back to back. We can only access the desired data in sequence. This is manageable if we are working with short tape, but it does not scale to terabyte volumes. The study hypothesized that if archival DNA could be synthesized in a way that allows a decoder to treat it as RAM and not as a sequential storage device, single files could be accessed quickly without ever having to sequence the entire DNA pool. See Fig. 14.4 for the revolutionary indexing (direct access) process of DNA. The Twist Bioscience team encoded 200 MB of binary data in 35 files and stored it in 13.5 million unique, high-quality DNA sequences all around 150 bp in length, each synthesized by Twist Bioscience. Twist Bioscience leads the market in high-throughput, low-cost DNA synthesis. The team was then able to select individual files or groups of files, use PCR to amplify these files from the pool of DNA, and then use a MinION DNA sequencer to sequence the amplified files. The sequencing data were then decoded to recover every file, byte-for-byte without error! A big step for humanity! The CEO of Twist Bioscience Dr. Emily M. Leproust eloquently said. “We are delighted to see the positive response and growing excitement over DNA as a solution to our world’s growing digital storage
Appendices
399
FIGURE 14.4 A joint venture between Microsoft, University of Washington, and Twist Bioscience achieved a successful random-access indexing mechanism to sequence DNA code. This is a remarkable achievement that will boost the implementation of DNA storage. We chopped each of the original DNA strands into three segments. Each segment had its start/stop primer. We attached to each segment an index with a pointer to the start primer of each segment. The stop primer of first segment connects to the start primer of the following or any segment.
dilemma, we have taken up the challenge of massively increasing DNA synthesis scale to accelerate adoption of DNA as the logical replacement for current legacy electronic and magnetic storage technologies. We are thrilled to continue our work with Microsoft and the University of Washington researchers to drive this technology forward.”
Appendices Appendix 14.A Svalbard Global Data Vault The Svalbard Seed Vault situated on the Norwegian island of Spitsbergen, 1300 km beyond the Arctic Circle. The vault is built to contain 4.5 million different seed samples from all over the world, ensuring that biodiverse plant life can be restored even in the event of a great natural disaster or man-made catastrophe. However, the vault is appropriately suited to store DNA data tubes. It reminds us of Noah’s Ark. Fig. 14.5 shows the geographic location of the island of Svalbard-right in the middle of the Arctic ocean, and it is restricted to residents and staff of the vault. Fig. 14.6 shows how the strange vault is implanted into the mountain. Fig. 14.7 inside the vault where massive racks are set up to store the seeds and the DNA tubes.
400
Chapter 14 DNA is a time storage machine for 10,000 years
FIGURE 14.5 The vault’s location is in Iceland for security against cyber or land malware. Geologically the structure is very stable and well insulated.
FIGURE 14.6 This is the entrance to the vault, which is 150 m deep in the mountain. A sharp concrete edge cutting its way out from within the mountain, with astonishing architecture. It is also covered with highly reflective stainless steel, mirrors, and prisms, all reflecting light in every direction. Courtesy of https://www.croptrust.org/our-work/svalbard-global-seed-vault.
Appendices
401
FIGURE 14.7 Although this vault is to preserve food, it will also be used to store DNA tubes because this place is unusually built sturdier than any castle. All the boxes are systematically indexed and sorted. The vault computer can locate any box instantly. The place is one of the world’s wonder. Courtesy of https://www.croptrust.org/our-work/svalbard-global-seed-vault.
The vault defies time, and the contents could be preserved for 15,000 year. All engineering and computer science technologies have been applied in the vault. The remoteness of the vault also grants considerable protection as well. The vault is intelligently secured with artificial intelligence technology against most threats, even extremely dire ones such as asteroids and nuclear bombs. The temperature will not exceed 4 C. The vault has its own auxiliary power supply in case of normal power failure. The vault can store a total of 4.5 million seed samples, where each sample is a bag of around 500 seeds stored in an airtight aluminum bag. DNA data banks will also be stored in the vault against cyber threats, account privacy, and permanent preservation. Virtual private network and demilitarized zone are implemented to protect medical, government, and private sectors. Two-factor authentication is deployed as well as authorization check points, and HTTPS provides an encrypted online session. So our little description of this marvelous wonder was built by visionary people. The Svalbard Global Seed Vault was built to help us preserve our civilization and the technologies of the 20th century. One thousand years from now, people encode DNA data to binary format, and observe our world of 2020, listen to Elton John Madman Across the Water song, or Miles Davis “Kind of Blue”, or visit the Capitol, or compare our Tesla to their driverless jets.
Appendix 14.B Glossary (Courtesy http://www.genesinlife.org/glossary) Abnormal result: A possible result of a screening test. An abnormal result does not determine a diagnosis, which means additional testing is needed to see if the individual has a condition. Also referred to as positive result. Acquired mutations: A change within a sequence of DNA caused by environment factors (sun, radiation, or chemicals), aging, or chance. Acute: Describes an illness that only affects an individual for a short period of time.
402
Chapter 14 DNA is a time storage machine for 10,000 years
ADA: The Americans with Disabilities Act of 1990 gives civil rights protections to individuals with disabilities and guarantees equal opportunity for individuals with disabilities in public accommodations, employment, transportation, state and local government services, and telecommunications. Adenine: One of four chemical bases in DNA, denoted (A), with the other three being cytosine (C), guanine (G), and thymine (T). Advocacy group: A group of people who work together to support a cause. Alkaptonuria: A rare genetic disorder in which a person’s urine turns a dark brownish-black color when exposed to air. Allele: One of two or more versions of a gene. An individual inherits two alleles for each gene, one from each parent. Amino acids: Amino acids are a set of 20 different molecules used to build proteins. Annotation: The process of identifying the locations of genes in a genome and determining the role of those genes. Annotation is a process that often follows gene sequencing. Autosomal chromosomes: All of the human chromosomes except for the X and Y chromosomes. Autosomal dominant: A pattern of inheritance where having only copy of the gene that does not work correctly results in the condition, and the condition affects males and females equally. Autosomal recessive: A pattern of inheritance where copies of the gene that do not work properly are needed to have the condition, and the condition affects males and females equally. Biobank: A collection of human biological samples (such as blood and tissue) and medical information about the people who gave their samples for research studies. Biomedicine: When the principles of natural sciences are used to evaluate and treat medical conditions. Biorepository: A collection of human biological samples (such as blood and tissue) and medical information about the people who gave their samples for research studies. Blinding: In a scientific experiment, a blind is where some of the people involved are prevented from knowing certain information that might lead to conscious or subconscious bias on their part, making the results not completely accurate. Blood sample: When blood is drawn from the human body to be tested for medical purposes. BRCA 1 and 2: The first two genes found to be associated with inherited forms of breast cancer. Cancer: A group of diseases characterized by uncontrolled cell growth. Cancer begins when a single cell mutates, resulting in a breakdown of the normal regulatory controls that keep cell division in check. Carrier: A person who has a change in only one gene of a pair, and the other gene of the pair is working normally. Carriers typically do not display the symptoms of the condition but can pass on the mutation to offspring. Carrier screening: A type of genetic testing to determine if an individual is a carrier for a genetic disease. Cell: The basic building blocks of all living things. Chromosome: An organized structure of DNA containing many genes that is wrapped around proteins found in cells. Humans typically have 23 pairs of chromosomes or 46 total. Chronic: Describes an illness that affects an individual for a long period of time, possibly their entire life. Chronic disease: A long-lasting health condition such as cancer, coronary heart disease, and diabetes.
Appendices
403
CLIA: Clinical Laboratory Improvement Amendments are regulations created in 1988 by the Center for Medicare and Medicaid Services to ensure quality laboratory testing on humans. Clinical geneticist: A physician with training in genetics who meets with patients to evaluate, diagnose, and manage genetic disorders. Clinical testing: Testing that is done to confirm if a person has a condition. Cloning: Creating an organism that has the same genes as the original. Coinsurance: Your share of the costs of a covered healthcare service, calculated as a percentage of the allowed amount for the service. Confirmatory test: Confirm or rule out a medical condition in an individual with concerning symptoms or an out-of-range screening result. Congenital: A condition that is present from birth. Copay: A fixed amount you pay for a covered healthcare service, usually when you get the service. The amount can vary by the type of covered healthcare service. Copy number variation: When the number of copies of a particular gene varies from one individual to the next. Cytogenetic: The branch of genetics that studies the number and structure of human chromosomes. Cytosine: One of four chemical bases in DNA, denoted (C), with the other three being adenine (A), guanine (G), and thymine (T). Deductible: The amount you owe for covered healthcare services before your health insurance plan begins to pay. Deoxyribonucleic acid: DNA, a molecule found in chromosomes that carries genetic information. DNA is composed of four units (called bases) that are designated A, T, G, and C. Diagnostic genetic testing: Genetic testing used to identify if an individual has a condition associated with symptoms they are showing. Direct-to-consumer genetic testing: A type of genetic testing that is available directly to the consumer without having to go through a healthcare professional. Disorder: A disturbance in physical or mental health functions. DNA replication: The process by which a molecule of DNA is duplicated. DNA sequence: The sequence of the bases of DNA spell-out instructions for making all of the proteins needed by an organism. Dolly: The first mammal ever cloned (a sheep). Dominant: Individuals receive one version of a gene from each parent. Sometimes, a version of a gene is dominant. Dominant genes have a more powerful effect than recessive genes and are thus more likely to be expressed or have a visible effect on the body. If a dominant gene and a recessive gene are inherited, the effects of the dominant gene will mask those of the recessive gene. Double helix: The twisted-ladder shape that two strands of DNA form. Down syndrome: Also called trisomy 21. Down syndrome is a genetic disease in which a person inherits an extra copy of chromosome 21. The extra chromosome causes problems with the way the body and brain develop. Dried-blood-spot testing: Testing the small amount of dried blood on the filter paper cards used in newborn screening. DTC: Direct-to-consumer is a type of genetic test that is available directly to the consumer without having to go through a healthcare professional.
404
Chapter 14 DNA is a time storage machine for 10,000 years
Emergency preparedness: The act of being prepared with your medical information in case an emergency event ever occurs. Endocrinologist: A doctor who specializes in disorders of the glands. Environmental factors: Chemicals, sun, and radiation that can cause mutations in DNA and can result in a disease that was acquired and not inherited. Enzyme: A protein that helps with chemical reactions in the body. Epigenetic markings: Changes in how the expression of a gene is regulated that is not caused by a change in the gene sequence. Epigenetics: An emerging field of science that studies heritable changes caused by the activation and deactivation of genes without any change in the underlying DNA sequence of the organism. Epigenome: It consists of chemical compounds that modify, or mark, the genome in a way that tells it what to do, where to do it, and when to do it. Eukaryote: A type of cell that has a nucleus and membrane-bound organelles. Exon: A section of DNA that serves as the set of instructions for constructing a protein. Expression: The process by which the information from a gene is translated and used to make a functional product, either a protein or a strand of RNA. When this occurs, the gene is said to have been expressed. False negative result: The result of a diagnostic test came back as normal when the disease is actually present. Tests are designed to make sure this type of mistake happens as little as possible. False positive result: The result of a diagnostic test came back positive or abnormal when the disease is not actually present. Tests are designed to make sure this type of mistake happens as little as possible. Family health history: A record of medical information about an individual and his or her family members, as well as information about the eating habits, activities, and environments the family shares. Family tree: A record of members of a family and their relationships. Fatal: Causing death. FDA: The Food and Drug Administration is the governmental organization responsible for protecting the public by assuring the safety, efficacy, and security of drugs, biological products, medical devices, food, and cosmetics. First degree relative: A family member who shares about 50% of their genes with a particular individual in a family. First-degree relatives include parents, offspring, and siblings. Follow-up testing: Testing procedure that takes place after a positive or abnormal test result. Follow-up testing is designed to limit false-positive results. Fragile X syndrome: A genetic disorder caused by mutations in a gene on the X chromosome. Fragile X syndrome affects mostly males. It is the most common form of inherited intellectual disability (mental retardation). Other symptoms include distinctive facial features and poor muscle tone. Fraternal twins: Results from the fertilization of two separate eggs during the same pregnancy. Fraternal twins may be of the same or different sexes, and they share half of their genes just like any other siblings. Frequency: The number of times something happens in a specific group. Gene: A sequence of DNA that carries the instructions for making a sequence of RNA, which in turn instructs and assists in the creation of proteins.
Appendices
405
Gene regulation: The process of turning genes on and off, which ensures that the appropriate genes are expressed at the proper times. Gene therapy: An experimental technique for treating disease that works by introducing a healthy copy of a nonfunctioning gene into the patient’s cells. Genetic code: The instructions in a gene that tell the cell how to make a specific protein. A, C, G, and T are the letters of the DNA code. Genetic counselor: A healthcare provider who has special training in genetic conditions. Genetic counselors help families understand genetic disorders and counsel families in making decisions about the testing or management of a genetic disorder. Genetic disease: A condition that is caused by changes in genes or chromosomes. Genetic disorder: A disease that is caused by an abnormality in an individual’s DNA. Genetic map: Shows where genes are located relative to each other on chromosomes. Genetic marker: A DNA sequence with a known physical location on a chromosome. Genetic markers can help link an inherited disease with the responsible gene. Genetic testing: A laboratory test to look for a change in a gene of an individual. The results of a genetic test can be used to confirm or rule out a diagnosis of a genetic disease. Geneticist: A doctor or scientist who studies heredity and how genes work and contribute to disease. Genetics: The scientific study of how particular qualities or traits are passed down from parents to child. Genome: The complete DNA sequence in the chromosomes of an individual. Genomics: The study of the entire genome of an organism, whereas genetics refers to the study of a particular gene. Genotype: A trait or gene that an individual inherits. GINA: The federal legislation passed in 2008 that makes it unlawful to discriminate against individuals based on their genes for health insurance or employment purposes. Guanine: One of four chemical bases in DNA, denoted (G), with the other three being adenine (A), cytosine (C), and thymine (T). Gynecologist: A doctor who specializes in the healthcare of women. Health insurance marketplace: A resource where individuals, families, and small businesses can: learn about their health coverage options; compare health insurance plans based on costs, benefits, and other important features; choose a plan; and enroll in coverage. Healthcare provider: A doctor, nurse, physician’s assistant, or genetic counselor. Hereditary mutations: A change within a gene that can be passed to offspring. Heredity: The passing of traits from parents to offspring. Heterozygous: Refers to having inherited different forms of a particular gene from each parent. Histone: Proteins that DNA wraps around as it coils into chromosomes. These proteins keep the DNA from becoming tangled and damaged. Homozygous: A genetic condition where an individual inherits the same alleles for a particular gene from both parents. Human Genome Project: An international project that mapped and sequenced the entire human. Human subjects protections: The government has policies to protect people that participate in genetics research.
406
Chapter 14 DNA is a time storage machine for 10,000 years
Huntington’s disease: A genetic disorder that affects muscle coordination and brain function. Nerve cells in certain parts of the brain waste away, or degenerate, so it is called a neurodegenerative disorder. It is inherited through an autosomal dominant pattern. Identical twins: Result from the fertilization of a single egg that splits in two. Identical twins share all of their genes and are always of the same sex. Immunity: An inherited, acquired, or produced resistance to infection by a specific pathogen. Immunization: The process of producing immunity to an infectious organism or agent in an individual or animal through vaccination. Immunologist: A doctor who specializes in conditions of the immune system. In-range screening result: The clinical test did not show any signs of conditions. Informed consent: Permission given by an individual to proceed with a specific test, procedure, or research study with an understanding of the risks and benefits of the activity. Inheritance: Passing of genes and traits from parents to child. Institutional Review Board (IRB): The IRB makes sure that risks to people are as low as possible in a research study. Isolate DNA: To separate DNA from the other cell components. Karyotype: Refers to an individual’s full set of chromosomes. May also refer to a photographic representation of an individual’s chromosomes with all 23 pairs positioned next to one another. Knockout: Refers to an organism that has been genetically engineered such that one or more specific genes are inactivated or do not work properly. Scientists create knockouts (often in mice) so that they can study the impact of these genes when they do not function and learn something about the genes’ function. Linkage: The close association of genes or other DNA sequences on the same chromosome. The closer the two genes are to each other on the chromosome, the greater the probability that they will be inherited together. Medicaid: An insurance program for families that have a low-income, pregnant women or people with disabilities that is paid for by the government. Medical geneticist: A doctor who specializes in genetics and genetic disorders. Medical home: The facility or physician who coordinates the care of an individual with a complex medical condition. Medicare: An insurance program for people over 65 years or people with end-stage renal disease that is paid for by the government. Mendelian inheritance: Refers to patterns of inheritance that are characteristic of organisms that reproduce sexually. Metabolic disorder: A disorder or defect in the way the body breaks down food or other products (metabolism). Methylation: When a base on the DNA strand is altered by the addition of a methyl group. This change causes that section of DNA to coil more tightly, preventing the genes around if from being used or expressed. This process is important as embryos develop and new cells take on specific roles in the body, but errors in DNA methylation have been linked to many human diseases. Mitochondria: Membrane-bound cell organelles that generate most of the chemical energy needed to power the cell’s biochemical reactions.
Appendices
407
Mitochondrial inheritance: The mitochondrion, an organelle in the cell, contains its own genome. Mutations in these genes are responsible for several known genetic diseases. Individuals only inherit mitochondrial DNA from their mothers. Mitosis: The process occurring in cells where all the chromosomes are replicated and the cell contents are equally divided into two daughter cells. Model organisms: Organisms used in medical research to mimic a disease found in humans and to study its prevention, diagnosis, and treatment. Mutation: Any change that occurs in a gene. These may occur because of errors in the replication process or directly from the environment. Most mutations do not have any effect, some may have positive effects, and many have harmful effects. Natural selection: The evolutionary process where the organism best adapted to its environment survives. Negative test result: A possible result of a screening or diagnostic test. If the result came back from a genetic screening test, it means that the test did not find any evidence of the genetic condition for which it was testing. If the result came back from a genetic diagnostic test, then the test did not find any evidence that the person has the genetic condition for which it was testing. Neonatal: During the first month of life. Newborn screening: A process of testing newborn babies for some serious, but treatable, conditions. Noninvasive: A medical test or procedure that does not require a doctor to insert any device through the skin or into a body opening. Nucleus: A membrane-bounded region inside each cell that provides a sanctuary for genetic information, including the long strands of DNA that encode this genetic information. Oncogene: A mutated gene that contributes to the development of a cancer. Opt-out: A patient’s right to refuse screening tests. Organ: A collection of tissues that structurally form a functional unit specialized to perform a particular function. Your heart, kidneys, and lungs are examples of organs. Organelle: A subcellular structure that has one or more specific jobs to perform in the cell, much like an organ does in the body. Out-of-pocket costs: Your expenses for medical care that are not reimbursed by insurance. Out-ofpocket costs include deductibles, coinsurance, and copayments for covered services plus all costs for services that are not covered. Out-of-range result: This result means that the screening test did show signs that the individual may be at higher risk of having one or more conditions. Patient confidentiality: The right of an individual patient to have personal, identifiable medical information kept private. Pediatrician: A primary care physician who specializes in the medical care of infants, children, and adolescents. Pedigree: A genetic representation of a family tree that diagrams the inheritance of a trait or disease though several generations. Pharmacogenetics: The study of how genetics determine drug behavior and why some drugs work differently between individuals. Phenotype: An individual’s observable characteristics or traits.
408
Chapter 14 DNA is a time storage machine for 10,000 years
Physician: A person licensed to practice medicine, also known as a medical doctor. Positive screen (positive test result): A possible result of a screening or diagnostic test. If the result came back from a genetic screening test, further testing must be done to determine if the person has the condition that was being tested for. If the result came back from a genetic diagnostic test, then the person has the condition and can pursue treatment options for that condition. Predictive genetic test: A genetic test for individuals not yet showing symptoms of a genetic disorder but have a family history of the condition or an increased risk of developing the condition. Predisposition: Symptoms are likely, but not certain to develop if testing suggests you have disease gene. Premium: The set dollar amount you pay each month to receive insurance coverage. Prenatal: Any time before the birth of the baby. Prenatal care providers: Healthcare professionals who aid a woman throughout her pregnancy. Presymptomatic: You will eventually develop symptoms if testing suggests you have the disease gene. Primary care provider: A doctor trained to treat a wide variety of health-related problems. Privacy protections: Ensure that blood spots cannot be accessed by a third party, including insurers and law enforcement. Progeria: A rare disease characterized by accelerated aging. Prokaryote: A type of cell that does not have a nucleus or membrane-bound organelles. Prostate cancer: A disease characterized by uncontrolled cell growth in the prostate gland, which is part of the male reproductive system. Protein: Proteins make up many parts of every cell in the body. Proteins are made up of amino acids. The order of these amino acids determines what form and job a protein has. Protein sequencing: The process of determining the order of amino acids (the molecules that make up proteins) of a particular protein. Public health: The science and practice of protecting and improving the health of a community. Pulmonologist: A doctor who specializes in lung conditions and diseases. Quality assurance: Process of defining the quality of performance required for each step in the testing process. Quality control: Monitoring the degree of adherence to defined criteria, taking corrective action when the system fails and documenting all of these events to convey the total quality of performance. Rare health conditions: An uncommon disorder that affects the ability of the human body to function normally. Recessive: A quality found in the relationship between two versions of a gene. Individuals receive one version of a gene, called an allele, from each parent. If the alleles are different, the dominant allele will be expressed, while the effect of the other allele, called recessive, is masked. In the case of a recessive genetic disorder, an individual must inherit two copies of the mutated allele for the disease to be present. Referral: Individuals receive one version of a gene from each parent. Sometimes a version of a gene is recessive. Recessive genes have a less powerful effect than dominant genes. If a dominant gene and a recessive gene are inherited, the effects of the recessive gene are not visible or are masked by the more powerful dominant gene’s effects. If two recessive genes are inherited, then the effects of the recessive genes are visible. Registry: A collection of medical information, clinical data, and demographics (age, male or female, etc.) about people with a specific disease or condition.
Appendices
409
Research geneticist: Geneticists who focus on research and study the origin, treatment, and prevention of genetic conditions. Retesting: When a test needs to be repeated to clarify, confirm, or reject the results of the initial test. Ribonucleic acid: RNA, a molecule similar to DNA. Unlike DNA, RNA is single stranded. An RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups. Attached to each sugar is one of four basesdadenine (A), uracil (U), cytosine (C), or guanine (G). SCID: Severe combined immunodeficiency is an inherited condition affecting the immune system causing individuals to be more susceptible to infectious diseases. Screening tests: Tests that analyze DNA samples to detect the presence of a gene or genes associated with an inherited disorder. Sequencing: DNA sequencing is a detailed description of the order of the chemical buildings blocks, or bases, in a given stretch of DNA. The sequence of bases tells scientists the type of genetic information that is carried in a particular segment of DNA. Sex chromosome: A sex chromosome is a type of chromosome that participates in sex determination. Humans have two sex chromosomes, the X and the Y. Females have two X chromosomes in their cells, whereas males have both X and a Y chromosome in their cells. Sex linked: A trait in which a gene is located on a sex chromosome. Sickle cell anemia: A disorder that is passed down through families and causes red blood cells to form an abnormal crescent, or sickle, shape. These sickled red blood cells cannot carry enough oxygen to the body. It is inherited in the autosomal recessive pattern. Single nucleotide polymorphisms (SNPs): A type of polymorphism involving variation of a single base pair. Scientists are studying how single nucleotide polymorphisms, or SNPs (pronounced snips), in the human genome correlate with disease, drug response, and other phenotypes. Social worker: A trained professional who provides social services to those in need. Somatic cells: Any cell in the body except for sperm and egg cells. Specialist: A healthcare provider who has special knowledge about a condition or a specific part of a condition. Standard medical procedure: Surgery or practice that is common and well accepted as the best course of treatment. State assistance: Payment given to individuals by government agencies on the basis of need. Stem cell: A cell with the potential to form many of the different cell types found in the body. Support group: A group of people who are all impacted by the same condition and come together to share experiences and help one another. Symptom: Evidence of a disorder or disease that directly affects and is noticed by the patient, such as a rash, pain, nausea, or a runny nose. Testing outcomes: The possible results you can receive after participating in a test such as positive, negative, or inconclusive. Thymine: One of four chemical bases in DNA, denoted (T), with the other three being adenine (A), cytosine (C), and guanine (G). Trait: A specific characteristic of an individual. Transcription: The process within the cell that uses DNA instructions to create pieces of RNA, which can then be used to make proteins or perform other tasks throughout the body.
410
Chapter 14 DNA is a time storage machine for 10,000 years
Transcription factor: Proteins that bind to specific sections of DNA and control transcription or the process of using DNA instructions to create new strands of RNA. Transition process: The time when an individual with a genetic condition or special healthcare needs must change his or her system of care to reflect his or her age. After reaching adolescence or adulthood, an individual will likely need to change healthcare providers, most likely from a pediatrician to an adult physician. Treatable condition: A condition with a known treatment that can improve the survival and/or quality of life of an individual. True positive result: A small percentage of individuals with out-of-range results do have the condition and must pursue treatment options. Uracil: One of four chemical bases that are part of RNA, denoted (U). The other three bases are adenine (A), cytosine (C), and guanine (G). Virus: An infectious agent that occupies a place near the boundary between the living and the nonliving. Viruses enter host cells and hijack the enzymes and materials of the host cells to make more copies of themselves. Whole genome sequencing: Whole genome sequencing is the mapping out of a person’s unique DNA. Your genome is the unique blueprint for your body. Working copy: A gene that functions the way it is intended to. X chromosome: One of two sex chromosomes. Humans have two sex chromosomes; the X and the Y. Females have two X chromosomes in their cells. Males have X and Y chromosomes in their cells. X-linked dominant: An inheritance pattern where X-linked means that the disease gene is located on the X sex chromosome and dominant means that having only one copy of the gene that does not work properly causes the condition. Affects more females than males. X-linked recessive: An inheritance pattern where X-linked means that the disease gene is located on the X sex chromosome and recessive means that two copies of the gene that does not work properly are needed to have the condition. Affects more males than females. Y chromosome: One of two sex chromosomes. Humans have two sex chromosomes, the X and the Y. Males have X and Y chromosomes in their cells.
Suggested readings Brighamand Women Research: www.brighamandwomens.org/about-bwh/newsroom/press-releases?S earchTerm¼&pageSize¼10&newsType¼pressreleases&category¼pressrelease&pageDetailUrl¼%2Fa bout-bwh %2Fnewsroom%2Fpress-releases-detail&year¼0&pageNum¼1 . Child Abuse and Neglect: https://www.nap.edu/read/18331/chapter/1 . DNA Records Indexing: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2696921/. DNA Methylation: https://www.cell.com/molecular-cell/pdf/S1097-2765(18)30642-7.pdf. DNA and Mortality: www.google.com/search?rlz¼1C1SQ JL_enUS819US822&tbm¼isch&q¼DNAþandþ mortality&chips¼q:dnaþandþmortality,online_chips:clock&usg¼AI4_-kToQREc6yFRiwB0w4MoTRu_ Q8OjHw&sa¼X&ved¼0ahUKEwjN2rHYvP3gAhVJnZ4KHQzVAm8Q4lYIKygC& biw¼1536&bih¼722& dpr¼2.5#imgrc¼nEJTe-foWDtx6M:. Duke University research about children: https://www.researchgate.net/scientific-contributions/39035243_ Kenneth_A_Dodge.
Suggested readings
411
DNA Longevity: https://www.smithsonianmag.com/science-nature/can-your-genes-predict-when-you-will- die164511528/#c9hr8cR84RLFMk2I.99. Fowler, C., September 6, 2016. Seeds on Ice: Svalbard and the Global Seed Vault. Prospecta Press. Francis, S., December 16, 2009. Collins, “The Language of Life: DNA and the Revolution in Personalized Medicine”, 1 edition. HarperCollins e-books. Giberson, K., March 15, 2011. The Language of Science and Faith: Straight Answers to Genuine Questions. IVP Books. Jurga, S., Jan, B., August 28, 2019. The DNA, RNA, and Histone Methylomes (RNA Technologies), first ed. Springer. Triple Helix Online: http://triplehelixblog.com/2017/02/living-by-the-clock-how-dna-methylation-affects- lifespan/. Predicting Death with DNA: https://www.independent.co.uk/news/science/test-that-can-predict-death- with-aterrifying-degree-of-accuracy-8329398.html. What is Epigenetics and Methylation: https://www.whatisepigenetics.com/what-is-epigenetics/.
CHAPTER
15
DNA and religion
In my view, all that is necessary for faith is the belief that by doing our best we shall succeed in our aims: the improvement of mankind . Science and everyday life cannot and should not be separated. dRosalind Franklin, Famous DNA Biologist.
I am against religion because it teaches us to be satisfied with not understanding the world. dRichard Dawkins.
I cannot imagine a God who rewards and punishes the objects of his creation, whose purposes are modeled after our ownda God, in short, who is but a reflection of human frailty. dAlbert Einstein.
There is a fundamental difference between religion, which is based on authority, [and] science, which is based on observation and reason. Science will win, because it works. dHawkins interview with Diane Sawyer/ABC News, June 2010.
It is in Apple’s DNA that technology alone is not enoughdit’s technology married with liberal arts, married with the humanities, that yields us the results that make our heart sing. dSteve Jobs.
Je crois en une vie apre`s la mort. Tout simplement parce que l’ e´nergie ne put pas mourir; elle cicule, se transforme et ne s’arr^ete jamais. dAlbert Einstein.
Ce grand ouvrage, toujours plus merveilleux a` mesure qu’ il est plus connu, nous donne une si grande ide´e de son ouvrier, que nous en sentons notre esprit accable´ d’admiration et de respect. dBernard le Bovier de Fontenelle (Feb. 11, 1657eJan. 9, 1757).
As knowledge advances, science ceases to scoff at religion; and religion ceases to frown on science. The hour of mockery by the one, and of reproof by the other, is passing away. Henceforth, they will dwell together in unity and goodwill. They will mutually illustrate the wisdom, power, and grace of God. Science will adorn and enrich religion; and religion will ennoble and sanctify science. dOliver Wendell (Member of Supreme Court of the United States from 1902 to 1932).
Storing Digital Binary Data in Cellular DNA. https://doi.org/10.1016/B978-0-12-823295-8.00015-X Copyright © 2020 Elsevier Inc. All rights reserved.
413
414
Chapter 15 DNA and religion
DNA and religion galaxies are intersecting What is religion? How can religion be defined? Ever since the world began, humans have demonstrated a natural inclination toward faith and worship of anything they considered superior or difficult to understand. Even today, religion has caused more global tumult that all the A-bombs of 5000 Hiroshimas. Here is a realistic description of what religion is all about: It is trip that you take on a small boat, by yourself with two oars, heading toward an unknown site far awaydequipped with instructions and strong will, and a small map to help you get there. You may be lost and end up in the middle of the ocean. Only your fortitude and good teaching to stay the course will lead you to land.
We all ride our personal boat We all ride our personal boatdno extra ridersdand head for the desired island. We all use the same constant speed. Some boats have different designs and color. Riders also differ in looks, color, age, and purpose. Riders start at different times and from different origins. But we all are heading for the same island. Some drown, some get lost, some change their mind and head back home, some fight with other boats, and some join one another as a team. Some people have no interest in going to the island. The people who arrived on the island keep their secrets and are not able to communicate with people from home. This is what religion is all about. Democracy insists that all religions have a common denominator galvanized with a moral code governing the conduct of human affairs, including devotional and ritual observances. In an age where engineering advancements are producing remarkable technologies, one is amazed at the power of human innovation. However, as astonishing as these technical advancements are, one should not forget the unshakable truth of the holy books that affirm that despite the wondrous capability of humans to engineer, all these inventions pale in significance compared with the ultimate creator.
The blessing of bacteria The field of biotechnology is built upon this very principledthat the greatest innovations are inherent within the natural synthesis of ideas, and it is only a matter of identifying and harnessing the power of organisms for humans to use. The challenge is in finding a system that can be perfectly employed. The complexities that are found even in microorganisms, such as bacteria, are so extensive that it was only in the 20th century that scientists began to understand how beneficial they could be to humans. It was in 1928 that scientists discovered penicillin, the first antibiotic. Antibiotics are chemicals produced by bacteria and fungi to fight off other microorganisms. Their discovery transformed medicine and led to the rapid reduction of infectious diseases across the world, saving millions of lives. Later, scientists discovered that due to the relative simplicity of these microorganisms in comparison with animals and plants, bacterial genes could be easily modified or changed, in a process known as recombination. This discovery was not only instrumental in understanding how DNA works, but it also opened the doors to the field of genetic engineering. Today, many medical treatments are based upon such technology.
The magic of CRISPR
415
For example, patients suffering from diabetes have very high levels of sugar in their blood because they are deficient in the hormone insulin. To treat diabetes, patients are given injections of insulin. Have you ever considered where insulin is made? Instead of going through the complexities of synthesizing insulin, scientists simply take the insulin gene and recombine it into bacterial DNA. These bacteria then make large quantities of the hormone, which can be collected. Hence, bacteria can act as remarkable genetic factories, with the ability to churn out whichever protein is inserted into their DNA. Many other drugs are produced in a similar way, by genetically modifying bacteria.
Peace between science and religiondinjecting “scriptural” DNA into the body Some people take Bible studies very seriously. But injecting oneself in the leg with bits of lab-made junk DNA based on verses from the Old Testament and Holy Quran takes things to a whole new level. Happily, Adrien Locatelli says that 5 months after the event, he has suffered no ill effects from his biohackdthat he is noticed, at least. Locatelli, a 16-year-old high school student from Grenoble, France, https://www.livescience.com, unveiled his achievement in an academic-type paper published in the open science framework (OSF), a site devoted to “open science.” The paper was titled, no doubt accurately, The first injection in a human being of macromolecules whose primary structure was developed from a religious text. Injecting oneself with artificial DNA is asking for trouble. But 5 months after his exploits, Locatelli reassures that he feels “very good” emotionally and physically. “Injections of junk DNA and junk proteins are not dangerous,” he says, explaining his thinking to the Haaretz newspaper. As Locatelli eloquently clarifies in his paper: “I am the only subject of this study and I took my chances.” No gene therapist would bet his microscope on that, but Locatelli feels that the probability of harm from injecting junk DNA and junk proteins “is almost nil.” In any case, he only injected himself with tiny amounts. The worst effect he suffered (so far, at least) was discomfort at one of the two injection sitesdlike a mosquito bite, he says, adding, “I know why: It’s because, to save money, I ordered a product of poor purity containing ‘remains’ of E. coli, like in a vaccine. I vaccinated myself against E. coli.”
The magic of CRISPR Just imagine a blind man gets back his vision, or someone with shaky hands (commonly referred to as a hand tremor) stops shaking, or a young person crippled with muscular dystrophy (MD) gradually regains his muscles and can throw away his leg braces. A new medical invention goes down deep into DNA and eliminates the neurological and degenerative mutation. The patient is healed and is back into the orbit of normal life. DNA testing is used today in many different fields. Whether for scientific research, medical research, or criminal investigations, its common use is causing much controversy, with today’s scientists believing that DNA tests can now reveal what human beings are made of and how they were created. Over decades, the field of genetic engineering has progressed steadily and has led to the production of genetically modified plants, insects, and even animals such as mice, which are used to study models of human diseases. However, since these animals are more complex than bacteria, their
416
Chapter 15 DNA and religion
production has proven difficult, time-consuming, and expensive. This has limited production only to those scientific or industrial labs, which house special facilities and have ample resources. Today, however, we are amid a revolution that is making gene editing a simple and routine process, which is being adopted in laboratories across the world. While there are multiple new gene editing techniques being developed, the most popular one is known as CRISPR (pronounced as crisper, short for Clustered Regularly Interspaced Short Palindromic Repeats). CRISPR-Cas (“Cas” stands for CRISPR-associated protein) was originally discovered in the 1980s in bacteria, but its function was not understood at the time. In the early 2000s, researchers found that CRISPR serves as an immune defense system in bacteria, and only recently has its potential for gene editing been explored. How does CRISPR work and how is it used for gene manipulation?
DNA and bacteria are allies Bacteria, like all organisms, face a real threat from invasive viruses. However, bacteria are vulnerable, as they are single celled and so do not have any special cells dedicated to recognizing and fighting off invaders, unlike animals, who have white blood cells and other immune cells. One of the ways bacteria defend themselves against viruses is by having short copies of different viral genes within their own DNA, which are known as CRISPR genes. The CRISPR genes are always on surveillance duty. When a bacterium is attacked by a virus, the genes match up to the viral DNA that has entered into the bacteria. The (CRISPR-associated protein) Cas protein is then able to cut the foreign DNA at the specific sites where there is a match between the viral DNA and the bacterial viral DNA copy, essentially destroying the invading virus. Interestingly, when exposed to new viruses, bacteria can expand their library of CRISPR genes and build up their arsenal of recognizing future viruses. CRISPER/Cas phenomenon is a three-step process to eliminate mutation in a DNA strand. Fig. 15.1 shows simplistically how mutation is eliminated from DNA strand. Gene mutation is a natural activity that takes place at random. Mutation will end up with an impact on the normal growth of the individual and will exhibit itself as a disease or as abnormality in muscle growth as in muscular dystrophy. Any random change in a gene’s DNA is likely to result in a protein that does not function normally or may not function at all. Such mutations are likely to be harmful. Harmful mutations may cause genetic disorders or cancer. If a mutation occurs in a noncoding region, then it will likely have little or no effect, even if it occurs in a coding region. It is possible that a mutation could be favorable, for example, sharper visual acuity that makes a predatory animal a more effective hunter. In general, however, the vast (97%) majority of all mutations are either harmful or neutral. If a mutation occurs in a recessive gene, it may not be expressed, i.e., the effects of the mutation will not be seen. In these cases, the mutation is carried in the genes of the individual and may be passed on to successive generations, only to show up as a harmful mutation in a later generation when an individual with homozygous recessive genes is produced. Scientists have discovered that, instead of a viral gene, a portion of any other gene can also be placed into the CRISPR library in the bacterial DNA. Once a cell encounters that specific gene again, the CRISPR-Cas system lines up its own copy of the DNA with the foreign DNA and cuts the foreign DNA precisely, enabling the gene to be removed and another gene to be put in its place. From this, the
The magic of CRISPR, the futuristic phenomenon
417
FIGURE 15.1 CRISPR/Cas9 is the magical gene-editing tool, able to modify DNA genes with amazing precision and accuracy. CRISPR will be used to treat diseases that have conquered medicine, but now there is bright light at the end of the tunnel.
altered foreign DNA can be transferred into different types of cells in a wide variety of organisms, ranging from insects to plants and even to human cells. It is popular because of its simplicity, low cost, and high efficiency. For this very reason, CRISPR is revolutionizing the field by “democratizing gene editing,” i.e., making gene editing possible in ordinary scientific laboratories.
The magic of CRISPR, the futuristic phenomenon Scientists from multiple fields are using CRISPR to add or remove genes, to study the role of specific mutations, and to develop animal models to see how diseases are inherited. Moreover, CRISPR is becoming a highly attractive tool because it offers the potential to correct genetic disorders and may even be used as a gene therapy in humans to replace faulty genes. Yet, this great breakthrough is not without controversy. The scientific community was both intrigued and startled in April 2015, when Chinese scientists used CRISPR on human embryos collected from fertility clinics. The researchers used the gene-editing tool to alter the gene responsible for causing a blood disorder known as b-thalassemia. The scientists were able to successfully alter the harmful mutation in a subset of the embryos but found that some complications, such as off-target effects (i.e., other genes being altered), also occurred. Although the embryos they used were not able to develop into live births, their work nonetheless raised ethical concerns regarding CRISPR technology. Similarly, in October 2015, scientists used CRISPR to successfully make a genetically altered line of mosquitoes resistant to the carriage of the malaria parasite. Their aim was to show that one possible
418
Chapter 15 DNA and religion
way of reducing the incidence of malaria would be to release such genetically modified mosquitoes into the environment. These mosquitoes would replicate and spread across a region, coming to dominate the mosquito population and help to reduce the spread of malaria. Although their experiment was only meant to illustrate that such gene editing was easily possible, the effects of manipulating an entire insect population could have catastrophic and unforeseen consequences within the ecosystem. There is a certain breed of technooptimist who likes to talk about “the singularity”da time when technology progresses so rapidly that life is transformed beyond recognition. The driving force of this hypothetical event is artificial intelligence (AI), but biotechnology plays a key role too. Singularity watchers are no doubt eyeing the ongoing gene-editing revolution with glee. Progress is dizzying, especially with a technique called CRISPR. As CRISPR use expands, such controversial experimenting will continue. To discuss the ethical concerns, an international panel of scientists held a summit in Washington, DC, in December 2015. They focused primarily on how gene editing should be used in humans and raised important questions. Would the gene-editing tool be used in humans only to return dysfunctional genes to a functioning, healthy state, or would it be used to alter characteristics based on fashion and preference? What regulations and safety standards should be employed before hospitals and clinics begin to offer geneediting therapy to parents? The summit was an important step, but international laws and standards have yet to be enacted. Gene editing and similar technologies in biomedicine should only be used to protect humanity. That is, they should only be used to change a diseased state to a healthy state. A misuse of this technology (gene editing), say, for nefarious reasons, would be against the principles of international medical ethics. Gene editing using CRISPR beautifully harnesses the natural processes within bacteria created by nature. It has immense potential to reduce suffering throughout the world and benefit mankind. Lasting treatments for incurable genetic diseases could finally be produced, thus improving the lives of millions of people. With this new wave of gene editing, it is the responsibility of the public and scientific community alike to ensure that laws and regulations on this matter conform with national and international laws and health organizations. It is no surprise that pivotal advances in science provoke religious metaphors. Crick and Watson’s discovery transformed our view of life itselfdfrom a manifestation of spiritual magic to a chemical process. One more territorial gain in the metaphysical chess match between science and religion. Charles Darwin’s theory of evolution was certainly a vital move in that chess gamedif not checkmate.
Stepping into God’s domain More than 1.5 billion people profess the religion called Islam. For vast numbers of these believers, their religion is their life. Rightly or wrongly, they are passionate about their religion! Although many Muslims are well-educated scientists and biomedical researchers, the powerful influence of Islamic religious beliefsdand the extremists who capitalize on the most militant understanding of those beliefsd“do not step into the domain of God.” DNA is unique, and every cell in the body has the same DNA stamp. DNA is like a blueprint of the soul, which is another sacred thing that is not like anyone else’s in the world. DNA traces your history
What is more importantdDNA or religion?
419
and your future because it determines who you came from and who you will create. It is a miracle of life. God is in the microscopic details, in the meeting of egg and sperm in the particular way that could only have led to you. DNA is a kind of holy ground. DNA is a sacred blueprint that should be left untouched. Reading DNA’s gene blueprint is like stepping into God’s domain. The religious fundamentalists of all religions of the world firmly believe that DNA should not be used as a commodity to describe the past and forecast the future.
What is more importantdDNA or religion? (Extracted from the Telegraph 3-3-2019). The scientists who launched a revolution during the discovery of the structure of DNA in Cambridge 50 years ago have both used the anniversary to mount an attack on religion. When Francis Crick and James Watson revealed DNA’s double-helix structure in 1953, they started the new field of biotechnology, which laid the foundation of the diversity of life on the earth and revealed the process of inheritance and shed light on diseases such as cancer, and even the origins of antisocial behavior. Crick and Wilkins were awarded the Nobel Prize in physiology or medicine in 1962 for their research on the structure of nucleic acids. Watson and Crick were both outspoken atheists.
Some atheistic arrogance Speaking to the Telegraph (British News Media), Crick, 86, said, “The god hypothesis is rather discredited.” Indeed, he says his distaste for religion was one of his prime motives in the work that led to the sensational 1953 discovery. Crick said: I went into science because of these religious reasons, there’s no doubt about that. I asked myself what were the two things that appear inexplicable and are used to support religious beliefs: the difference between living and nonliving things, and the phenomenon of consciousness.
Crick argues that since many of the actual claims made by specific religions over 2000 years have proved false, the burden of proof should be on the claims they make today, rather than on atheists to disprove the existence of God. He said: “Archbishop Ussher claimed the world was created in 4004 bc. Now we know it is 4.5 billion years old. It’s astonishing to me that people continue to accept religious claims. People like myself get along perfectly well with no religious views.” His codiscoverer, Watson, 74, told the Telegraph that religious explanations were “myths from the past . Every time you understand something, religion becomes less likely.” He continued: “Only with the discovery of the double helix and the ensuing genetic revolution have we had grounds for thinking that the powers held traditionally to be the exclusive property of the gods might one day be ours.”
The Collins defense The Human Genome Project, which reads the genetic recipe of a human being, is currently led by a devout Christian, Francis Collins, who succeeded Watson in that post in 1993. He complained at a recent meeting of scientists in California that God was receiving a “cold reception” during the celebrations marking the 50th anniversary of the discovery of DNA’s structure. Collins was concerned that
420
Chapter 15 DNA and religion
the antireligious polls of 70%e80% of the people will increase public antipathy to genetics. Another survey revealed that this belief is held by 40% of working scientists. “One should not assume that the perspective so strongly espoused by Watson and Crick represents the way that all scientists feel,” said Collins. Collins has, in the past, worked in a mission hospital in West Africa. Religion and science “are nicely complementary and mutually supporting,” he said. As one example, his research to find the faulty gene responsible for cystic fibrosis provided scientific exhilaration and “a sense of awe at uncovering something that God knew before that we humans didn’t. The tragedy is that many people believe that, if evolution is true, which it clearly is, then God cannot be true. However, he blamed this on the reaction of the scientific establishment to the literal interpretation of Genesis by Creationists, views not held by respectable theologians. Collins outlined his own belief: “God decided to create a species with whom he could have fellowship. Who are we to say that evolution was a dumb way to do it? It was an incredibly elegant way to do it.”
Dr. Crick and Churchill arguments The antipathy to religion of the DNA pioneers is long standing. In 1961, Crick resigned as a fellow of Churchill College, Cambridge, when it proposed to build a chapel. When Sir Winston Churchill wrote to him pointing out that “none need enter the chapel unless they wish,” Crick replied that on those grounds, the college should build a brothel and enclosed a check for 10 guineas. Watson described how he gave up attending mass at the start of the World War II. “I concluded that the Church was just a bunch of fascists that supported Franco. I stopped going on Sunday mornings and watched the birds with my father instead.” Watson’s interest in ornithology led to a glittering career in sciencedand the discovery of the double helix. Dr. Ting Wu (wife of Dr. George Church) is a professor of genetics at Harvard Medical School. Wu believes people should know what scientific advances mean for them. The challenge is empowering communities that are skeptical of science because they have been underserved or even mistreated in the past. Wu’s outreach to faith groups comes as advances in genetics are forcing scientists to grapple with the power of their newly discovered technology. The issue driving much of the ethical debate these days is genome editing, which has become much simpler and more efficient with CRISPR. The problem is that the difference is in the eye of the beholder. Would editing a genome to protect people from HIV be considered a treatment? Should scientists eliminate Down syndrome or genetic causes of blindness? Those conditions are viewed by some as disabilities but by others as traits that should in their own ways be respected and embraced. Religious leaders and bioethicists have debated genome editing for decades, but it has largely been a theoretical consideration. CRISPR makes oncetheoretical notionsdsay, editing the genomes of embryosda very real possibility. (Those changes are called “germline” edits and would be passed on to future generations.) It is a revolution that is being driven by scientists like Wu’s husband, famed geneticist and her Harvard Medical School colleague, Dr. George Church.
Religious communities are not happy As with scientists and secular bioethicists, religious communities have shown varying degrees of comfort with the notion of genome editing. Procedures aimed at curing disease are generally in line
DNA is human future diary, cannot be fooled
421
with certain religious tenets, even if those procedures require sophisticated technology; the Vatican said in 2002 that “germ line genetic engineering with a therapeutic goal in man would in itself be acceptable” if it could be done safely and without leading to the loss of embryos. But genome editing could, at least in theory, be used to do much morednot just to treat conditions, but to “enhance” human beings, as bioethicists put it. The human genome naturally picks up mutations, and genes are turned “on” or “off” by a number of factors, so it is not as if the genome is fixed. But even some leading scientists have expressed the belief that there is something sacred about the human genome. They say it deserves reverence as scientists continue to find ways to manipulate it or, even more radically, consider ways of recreating it synthetically.
Personal and medical data used as DNA fingerprint More than 50 years have passed since Watson and Crick discovered the structure of DNA, and the double helix has replaced the caduceus as the symbol of scientific and medical progress. We have mapped the human genome and embarked on identifying and curing heretofore intractable genetic conditions. With startling swiftness, we have also constructed DNA databases and storage banks to manage the genetic information generated by these discoveries. The most zealous advocates for these new technologies imagine only the endless possibilities: We will solve and deter crime; we will rescue the falsely convicted from prison sentences or execution; we will uncover our genetic ancestry; we will map, understand, and cure dreaded diseases; we will tailor pharmaceuticals according to each individual’s genetic makeup; we will gain crucial understanding about the respective roles of nature and nurture in shaping human identity; and we will create the genetic economy of the future. The public discussion of DNA fingerprinting has focused primarily on its uses within the criminal justice system. In the United States, the first criminal conviction based on DNA evidence came in 1987. The battles in the late 1980s and early 1990s over the effectiveness and accuracy of DNA as forensic evidencedinfamously featured in the televised murder trial of O.J. Simpsondproved in the end to be merely a splendid little war. Courts quickly embraced DNA evidence as legally admissible, and legislatures were soon responding to law enforcement’s claims that they needed DNA databases to manage this new and powerful form of forensic information. Within 10 years of that first conviction, all 50 states required convicted felons to submit DNA samples; soon every state had established its own criminal DNA database.
DNA is human future diary, cannot be fooled While focusing so much on dramatic stories to find the guilty and free the innocentdor the prospect to use genetic information to cure diseasedwe eclipse the full benefits and the inherent risks of the bad edge of DNA. While the creation of DNA databases often can be defended case-by-case, the development of this technology serves an end in itself apart from any particular application. It provides an inescapable means of identification, categorization, and profiling, and it does so with a type of information that is revelatory in a way few things are. As bioethicist George Annas put it, DNA is a person’s “future diary.” It provides genetic information unique to each person; it has the potential to reveal to third parties a person’s predisposition to
422
Chapter 15 DNA and religion
illnesses or behaviors without the person’s knowledge, and it is permanent information, deeply personal, with predictive powers. Taken together, the coming age of DNA technology will change the character of human life, both for better and for worse, in ways that we are only beginning to imaginedboth because of what it will tell us for certain and what it will make us believe. To know one’s own future diarydor to know someone else’sdis to call into question the very meaning and possibility of human liberty.
Finally, CODIS was born In 1994, the DNA Identification Act created CODIS (Combined DNA Identification System), a national DNA database run by the FBI, which links all state crime databases (https://www.fbi.govi services i laboratory i biometric-analysis i codis) Today, the newspapers regularly run stories of a murderer who has been identified through a “cold hit” on a DNA database or of an innocent man being freed from prison after DNA evidence exonerates him. In March 2003, Attorney General John Ashcroft announced a new initiative, “Advancing Justice Through DNA Technology,” that seeks $1 billion over the next 5 years to aid in “realizing the full potential of DNA technology to solve crime and protect innocent people.” Media coverage focused on the initiative’s efforts to eliminate the backlog of DNA samples at state and federal criminal laboratories, but the initiative seeks something else as well: the expansion of CODIS. The Bush administration was keen on giving the FBI access to the full range of samples in state DNA databasesdincluding those of people placed under arrest but not convicteddrather than the smaller range of samples currently included.
Now religion speaks about DNA If you search Google for the three words: conflict, science, and religion, it will return about 95.5 million results in early 2018! Conflicts have been going full blast between science and religion, which have two very different concepts of truth. Religious truth is typically derived from people’s close relationship with God and to follow the Hebrew and Christian scriptures (aka Old and New Testament)dwithout error. The same applies to all Muslims, who follow the Holy Quran and the teachings (Hadith) of the Prophet Mohammed. On the other hand, scientific truth starts with the imagination to invent or discover something in nature. Scientists formulate a concept that might explain their observations. They conduct further tests to apply to reality. Later, concepts mature and become theories. For example, the second law of thermodynamics and general relativity are well-established laws and have been used in technology. The science of biomedicine, called genetics, tells us that mutations in DNA can be replaced with healthy genes and can heal patients. Using CRISPR is stepping into the domain of God. However, the success of editing DNA to heal patients earned consent of some clergies who guardedly praized DNA fixing. The science of biomedicine is going to progress deeper into the human anatomy, and new discoveries will emerge, while the war against religion will march forward.
Concluding thoughts
423
The Reverend Colleen Squires, minister at All Souls Community Church of West Michigan, a Unitarian Universalist congregation, responds to the question of DNA and citizen private data: The seventh Principle of Unitarian Universalism is the Respect for the interdependent web of all existence of which we are a part. This question regarding DNA commercial databases is an example of how we are all ultimately connected. The question also highlights once again that science is moving faster than our ethical and moral understanding of the ramifications of applying this science in various uses. If the science is used to convict or to clear an individual of a crime, I think this is an ethical use of the database. When it connects somebody who had once asked for their samples to remain anonymous, as in medical research subjects or for sperm or egg donors, I think it is a breach of the original agreement. I think we have entered an age of discovery in science that will end the idea of an anonymous donation; we are indeed all connected.
The story of evolution In 1859, Charles Darwin published the book On the Origin of Species by Means of Natural Selection. He discussed his theory of evolution and its cause as well as the driving force behind evolution as occasional, minute, accidental, random, and beneficial changes in an animal’s DNA, which gave it a greater likelihood of generating more offspring. Eventually, through many such positive changes over a long-time interval, a new animal species might evolve. Darwin’s book and theory triggered a major genetics-based conflict between science and religion, which is still being fought today. He once said, “It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is most adaptable to change.” Meanwhile, essentially all biologists, zoologists, and the like fully accept, in principle, the theory of evolution. Again, genetics is at the core of a major conflict between science and religion.
Concluding thoughts No, we did not come from monkeys. Do you think 100,000 years from now, people are going to believe that we are monkeys? No, I do not think so. The people then may be much smarter and healthier than we are today. Technology then will be totally immersed in the body of the citizens of the world, and they will enjoy a much higher level of living and prosperity. The process of evolution for us moves forward with natural changes that will make us better physically and mentally and will adapt to the environment. The sick will die, and the healthy will survive. Wedthe Homo sapiensdhave originated through the process of biological evolution. We will regularly interbreed and produce fertile offspring who will also be capable of reproducing. So evolution occurs when there is change in the genetic materialdthe chemical molecule, DNAdwhere our kids will inherit from us. Genes represent the segments of DNA that provide the chemical code for producing proteins. Information contained in the DNA can change by a process known as mutation, where genes will influence the body or behavior of an organism. Genes affect how the body and behavior of an organism develop during its life, and therefore genetically inherited characteristics can influence the likelihood of an organism’s survival and reproduction.
424
Chapter 15 DNA and religion
Paleoanthropology is the scientific study of human evolution. It is a subfield of anthropology, the study of human culture, society, and biology. The field involves an understanding of the similarities and differences between humans’ and other species’ genes, body form, physiology, and behavior. Paleoanthropologists search for the roots of human physical traits and behavior. They seek to discover how evolution has shaped the potentials, tendencies, and limitations of all people. For many people, paleoanthropology is an exciting scientific field because it investigates the origin, over millions of years, of the universal and defining traits of our species. However, some people find the concept of human evolution troubling because it does not seem to fit with religious and other traditional beliefs about how people, other living things, and the world came to be. Nevertheless, many people have come to reconcile their beliefs with the scientific evidence. But evolution does not change any single individual. Instead, it changes the inherited means of growth and development that typify a population (a group of individuals of the same species living in a particular habitat). Parents pass adaptive genetic changes to their offspring, and ultimately these changes become common throughout a population. As a result, the offspring inherit those genetic characteristics that enhance their chances of survival and ability to give birth, which may work well until the environment changes. Over time, genetic change can alter a species’ overall way of life, such as what it eats, how it grows, and where it can live. Human evolution took place as new genetic variations in early ancestor populations favored new abilities to adapt to environmental change and so altered the human way of life. The ethical debates between religion and DNA used to be consigned to speculative fiction. But as genetic editing moves from science fiction into real-world laboratories, the need to define the moral limits of science is growing more rapidly. While we have made significant process in understanding human genetics, gene editing is currently not a totally precise science. Genomes are complex systemsdevery gene within a genome is interrelated, and we do not really know how changing one gene may affect others.
The last words of Einstein Einstein’s last words are not recorded; they were spoken in German, and the nurse in the room did not understand the language. So the last words should go to the others who knew him well. Elsa Einstein summed up his genius in a throwaway remark she made at the Mount Wilson observatory in California. An astronomer showed her a new telescope and explained that it was used for finding out the shape of the universe. Elsa responded, “Oh, my husband does that on the back of an old envelope!”
Appendices Appendix 15.A The God Gene. how faith is hardwired into our genes (extracted from wikipedia) God Gene hypothesis proposes that human spirituality is influenced by heredity and that a specific gene, called Vesicular MonoAmine Transporter 2 (VMAT2), predisposes humans toward spiritual or mystic experiences. The idea has been proposed by geneticist Dean Hamer, in the 2004 book, The God Gene: How Faith Is Hardwired into Our Genes.
Appendices
425
The God Gene hypothesis is based on a combination of behavioral genetic, neurobiological, and psychological studies. The major arguments of the hypothesis are as follows: 1. Spirituality can be quantified by psychometric measurements. 2. The underlying tendency to spirituality is partially heritable. 3. Part of this heritability can be attributed to the gene VMAT2. This gene acts by altering monoamine levels. 4. Spirituality provides an evolutionary advantage by providing individuals with an innate sense of optimism. Scientific response: In the brain, VMAT2 proteins are located on synaptic vesicles. VMAT2 transports monoamine neurotransmitters from the cytosol of monoamine neurons into vesicles. PZ Myers argues, “It’s a pump. A teeny-tiny pump responsible for packaging a neurotransmitter for export during brain activity. Yes, it’s important, and it may even be active and necessary during higher order processing, like religious thought. But one thing it isn’t is a ‘god gene.’”
Carl Zimmer claimed that VMAT2 can be characterized as a gene that accounts for less than 1% of the variance of self-transcendence scores. These, Zimmer says, can signify anything from belonging to the Green Party to believing in ESP. Zimmer also points out that the God Gene theory is based on only one unpublished, unreplicated study. However, Hamer notes that the importance of the VMAT2 finding is not that it explains all spiritual or religious feelings, but rather that it points the way toward one neurobiological pathway that may be important.
426
Chapter 15 DNA and religion
Religious response: John Polkinghorne, an Anglican priest, member of the Royal Society and Canon Theologian at Liverpool Cathedral, was asked for a comment on Hamer’s theory by the British national daily newspaper, The Daily Telegraph. He replied: “The idea of a God gene goes against all my personal theological convictions. You can’t cut faith down to the lowest common denominator of genetic survival. It shows the poverty of reductionist thinking.” Walter Houston, the chaplain of Mansfield College, Oxford, and a fellow in theology, told the Telegraph: “Religious belief is not just related to a person’s constitution; it’s related to society, tradition, characterdeverything’s involved. Having a gene that could do all that seems pretty unlikely to me.” Hamer responded that the existence of such a gene would not be incompatible with the existence of a personal God: “Religious believers can point to the existence of God genes as one more sign of the creator’s ingenuityda clever way to help humans acknowledge and embrace a divine presence.” He repeatedly notes in his book that “This book is about whether God genes exist, not about whether there is a God.”
Appendix 15.B Some laws and principles in evolutionary biology (from library of MERIT CyberSecurity) Allen’s rule: Within species of warm-blooded animals (birds + mammals), those populations living in colder environments will tend to have shorter appendages than populations in warmer areas. Bateman’s principle: Males gain fitness by increasing their mating success, whereas females maximize their fitness by investing in longevity because their reproductive effort is much higher. Bergmann’s rule: Northern races of mammals and birds tend to be larger than southern races of the same species. Beijernek’s principle (of microbial ecology): Everything is everywhere; the environment selects. Bulmer effect: Genetic variance is reduced by selection, in proportion to the reduction of phenotypic variance of the parents relative to their entire generation. This is the Bulmer effect. Coefficient of relatedness: r ¼ n (0.5) L, where n is the alternative routes between the related individuals along which a particular allele can be inherited; L is the number of meiosis or generation links. Cope’s law of the unspecialize: The evolutionary novelties associated with new major taxa are more likely to originate from a generalized member of an ancestral taxon rather than a specialized member. Cope’s rule: Animals tend to get larger during the course of their phyletic evolution. There is a gradient of increasing species diversity from high latitudes to the tropics (see New Scientist, April 4, 1998, p. 32). Fisher’s fundamental theorem: The rate of increase in fitness is equal to the additive genetic variance in fitness. This means that if there is a lot of variation in the population, the value of selection will be large. Fisher’s theorem of the sex ratio: In a population where individuals mate at random, the rarity of either sex will automatically set up selection pressure favoring production of the rarer sex. Once the rare sex is favored, the sex ratio gradually moves back toward equality.
Appendices
427
Galton’s regression law: Individuals differing from the average character of the population produce offspring, which, on the average, differ to a lesser degree but in the same direction from the average as their parents. Gause’ rule (competitive exclusion principle): Two species cannot live the same way in the same place at the same time (ecologically identical species cannot coexist in the same habitat). This is only possible through evolution of niche differentiation (difference in beak size, root depths, etc.). Haeckel’s biogenetic law: Proposed by Ernst Haeckel in 1874 as an attempt to explain the relationship between ontogeny and phylogeny. It claimed that ontogeny recapitulates phylogeny, i.e., an embryo repeats in its development the evolutionary history of its species as it passes through stages in which it resembles its remote ancestors (embryos, however, do not pass through the adult stages of their ancestors; ontogeny does not recapitulate phylogeny). Rather, ontogeny repeats some ontogenydsome embryonic features of ancestors are present in. Haldane’s hypothesis (on recombination and sex): Selection to lower recombination on the Y-chromosome causes a pleiotropic reduction in recombination rates on other chromosomes (hence, the recombination rate is lower in heterogametic sex such as males in humans, females in butterflies). Hamilton’s altruism theory: If selection favored the evolution of altruistic acts between parents and offspring, then similar behavior might occur between other close relatives possessing the same altruistic genes, which were identical by descent. In other words, individuals may behave altruistically not only to their own immediate offspring but also to others such as siblings, grandchildren, and cousins (as happens in the bee society). Hamilton’s rule (theory of kin selection): In an altruistic act, if the donor sustains cost C, and the receiver gains a benefit B as a result of the altruism, then an allele that promotes an altruistic act in the donor will spread in the population if B/C > 1/r or rB C > 0 (where r is the relatedness coefficient). Hardy-Weinberg law: In an infinitely large population, gene and genotype frequencies remain stable as long as there is no selection, mutation, or migration. In a panmictic population in infinite size, the genotype frequencies will remain constant in this population. For a biallelic locus where the allele frequencies are p and q: p2 + 2pq + q2 ¼ 1 (see Basic Population Genetics for more). Heritability: The proportion of the total phenotypic variance that is attributable to genetic causes: h2 ¼ genetic variance/total phenotypic variance. Natural selection tends to reduce heritability because strong (directional or stabilizing) selection leads to reduced variation. Lyon hypothesis: The proposition by Mary F. Lyon that random inactivation of one X chromosome in the somatic cells of mammalian females is responsible for dosage compensation and mosaicism. Muller’s ratchet: The continual decrease in fitness due to accumulation of (usually deleterious) mutations without compensating mutations and recombination in an asexual lineage (H.J. Muller, 1964). Recombination (sexual reproduction) is much more common than mutation, so it can take care of mutations as they arise. This is one of the reasons why sex is believed to have evolved. Parental investment theory (Robert Trivers): The sex making the largest investment in lactation, nurturing, and protecting offspring will be more discriminating in mating, and the sex that invests less in offspring will compete for access to the higher-investing sex (Trivers RL. Parental investment and sexual selection. In Campbell BG (Ed) Sexual Selection and the Descent of Man, 1871e1971. Chicago: Aldine, 1972, pp. 136e179; ISBN 0-43-562,157-2). See also TriverseWillard hypothesis.
428
Chapter 15 DNA and religion
Protein clock hypothesis: The idea that amino acid replacements occur at a constant rate in a given protein family (ribosomal proteins, cytochromes, etc.), and the degree of divergence between two species can be used to estimate the time elapsed since their divergence. Red Queen theory: An organism’s biotic environment consistently evolves to the detriment of the organism. Sex and recombination result in progeny genetically different from the previous generations and thus less susceptible to the antagonistic advances made during the previous generations, particularly by their parasites. Selection coefficient (s): s ¼ 1 e W, where W is relative fitness. This coefficient represents the relative penalty incurred by selection to those genotypes that are less fit than others. When the genotype is the one most strongly favored by selection, its s value is 0. Selection differential (S) and response to selection (R): Following a change in the environment, in the parental (first) generation, the mean value for the character among those individuals that survive to reproduce differs from the mean value for the whole population by a value of (S). In the second, offspring generation, the mean value for the character differs from that in the parental population by a value of R, which is smaller than S. Thus, strong selection of this kind (directional) leads to reduced variability in the population. Tangled bank theory: An alternative theory to the Red Queen theory of sex and reproduction. This one states that sex and recombination function to diversify the progeny from each other to reduce competition among them (see a review by Burt & Bell (1987) on the Tangled Bank theory). TriverseWillard hypothesis: In species with a long period of parental investment after birth of young, one might expect biases in parental behavior toward offspring of different sex, according to the parental condition; parents in better condition would be expected to show a bias toward male offspring (Trivers-Willard, 1973). van Baer’s rule: The general features of a large group of animals appear earlier in the embryo than the special features. Weismann’s hypothesis: Evolutionary function of sex is to provide variation for natural selection to act on (see Burt, 2000 for a review of Weismann’s hypothesis). WrighteFisher Model: The most widely used population genetics model for reproduction. It assumes a finite and constant size (N) and nonoverlapping population and random mating. One of the results is that if a new allele appears in the population, its fixation probability is its frequency (1/2N). See a lecture note on WrighteFisher reproduction.
Appendix 15.C Holy Quran refers to DNA The amazing science of biology is playing a key role in uncovering the secrets of DNA, which the holy Quran had adequately referred to more than a 1000 years ago. Genetics may soon lead to the elimination of crime, healthier and longer lives, and new animal species that have certain desired characteristics genetically incorporated.
The programming in genes If we compare the human body to a building, the body’s complete plan and project down to its minute technical detail is present in DNA, which is located in the nucleus of each cell. All the developmental phases of a human being in the mother’s womb and after birth take place within the
Appendices
429
outlines of a predetermined program. This perfect order in the development of humankind is stated as follows in the Quran:
The word “qaddara,” translated as “destined” in the above verse, comes from the Arabic verb “qadara.” It also translates as “arranging, setting out, planning, programming, seeing the future, the writing of everything in destiny (by Allah).” When the father’s sperm cell fertilizes the mother’s egg, the parents’ genes combine to determine all of the offspring’s physical characteristics. Each one of these thousands of genes has a specific function. It is the genes that determine the color of the eyes and hair, height, facial features, skeletal shape, and the countless details in the internal organs, brain, nerves, and muscles and that we will bear when we come to our 30th year. These genes are encoded in the nucleus of our inaugural cell 30 years and 9 months beforehand, starting from the moment of insemination. In addition to all the physical characteristics, thousands of different processes take place in the cells and bodydand indeed control of the whole systemdare recorded in the genes. For example, whether a person’s blood pressure is generally high, low, or normal depends on the information in his or her genes. The first cell that forms when the sperm and the egg are joined also forms the first copy of the DNA molecule, which will carry the code in every cell of the person’s body, right up until death. DNA is a molecule of considerable size. It is carefully protected within the nucleus of the cell, and this molecule is an information bank of the human body as it contains the genes we mentioned earlier. The first cell, the fertilized egg, then divides and multiplies in the light of the program recorded in the DNA. The tissues and organs begin to form: This is the beginning of a human being. The coordination of this complex structuring is brought about by the DNA molecule. This is a molecule consisting of atoms such as carbon, phosphorus, nitrogen, hydrogen, and oxygen.
Holy Quran reference to DNA The term “DNA” is an abbreviation of the genetic material in living things. The beginning of the science of genetics dates back to genetic laws drawn up by the scientist, the Austrian monk Johann Gregor Mendel in 1865. The date, a turning point in the history of science, is referred to in verse 65 of Surat al-Kahf, or verse 18:65, of the Quran. When the appearance of the letters D-N-A (Dal-Nun-Alif in Arabic) side by side in places in the Quran is examined, they appear most frequently in verse 65 of Surat al-Kahf. The letters D-N-A appear side by side three times in this verse, in a most incomparable manner. In no other verse of the Quran, do these letters appear consecutively so often. The number of this exceptional verse in which the term
430
Chapter 15 DNA and religion
DNA appears so strikingly is 18:65. These numbers are an expression of the date when the science of genetics began. This cannot be regarded as a coincidence. This is something miraculous, because the scientific world only came up with the name DNA (deoxyribonucleic acid) comparatively recently. In Surat al-Kahf, which refers to DNA and the year 1865 when the science of genetics began, DNA is repeated seven times, as is RNA (the Arabic letters Ra-Nun-Alif). Like DNA, the RNA molecule is a molecule giving rise to genetic structure. DNA and RNA are identical parallel strands and they are mentioned equal number of times in Quran hundred years ago. (Allah knows the truth.)
Appendix 15.D The information capacity of DNA The information capacity recorded in DNA is of a size that astonishes scientists. There is enough information in a single human DNA molecule to fill a million encyclopedia pages or 1000 volumes. To put it another way, the nucleus of a cell contains information, equivalent to that in a 1-million-page encyclopedia. It serves to control all the functions of the human body. To make a comparison, the 23volume Encyclopedia Britannica, one of the largest encyclopedias in the world, contains a total of 25,000 pages. Yet a single molecule in the nucleus of a cell, and which is so much smaller than that cell, contains a store of information 40 times larger than the world’s largest encyclopedias. That means that what we have here is a 1000-volume encyclopedia, the like of which exists nowhere else on the earth. As in Harun Yahya book, Allah’s Miracles in the Qur’an: This is a miracle of design and creation within our very own bodies, for which evolutionists and materialists have no answer.
Advantages of knowing DNA With the growing understanding of genomics, scientists are now trying to understand the process of aging and how it could be slowed down. Average life expectancies have increased significantly during the past 60 years due to better hygiene and medications, although they still vary considerably from country to country. For instance, the average life expectancy in Japan is about 82 years, whereas in many African countries, it is below 50 years. The understanding of genetics is also leading to the development of new weapons to fight disease. For example, some 200 million people suffer from malaria each year, and about 800,000 deaths are recorded annually due to it. The vast majority of these deaths (over 90%) occur in sub-Saharan Africa, and children are the most vulnerable. The disease is caused by the bite of a female mosquito. Dengue and yellow fever are also spread by mosquito bites. Antimalarial drugs such as chloroquine are losing their efficacy because of the development of resistance. An exciting new approach to tackle malaria is to develop genetically modified mosquitoes. These have the potential to bring down the population of the harmful female variety.
Appendix 15.E Glossary for research ethics (from library of MERIT CyberSecurity) Accountability: Taking personal responsibility for one’s conduct. Accreditation: A process in which an accrediting body determines whether an institution or organization meets certain standards developed by the body. For example, the Association for the Assessment and Accreditation of Laboratory Animal Care (AAALAC) accredits animal research
Advantages of knowing DNA
431
programs, and the Association for the Accreditation of Human Research Protection Programs (AAHRPP) accredits human subjects research programs. Adverse event (AE): A medically undesirable event occurring in a research subject, such as an abnormal sign, symptom, worsening of a disease, injury, etc. A serious adverse event (SAE) results in death, hospitalization (or increased hospital stay), persistent disability, birth defect, or any other outcome that seriously jeopardizes the subject’s health. AEs that are also unanticipated problems should be reported promptly to institutional review boards and other appropriate officials. Amendment: A change to a human subjects research protocol approved by an institutional review board or the board’s chair (if the change is minor). Animal care committee: See Institutional animal care and use committee. Animal rights: The view that (nonhuman) animals have moral or legal rights. Proponents of animal rights tend to regard animal experimentation as unethical because animals cannot consent to research. Animal welfare: 1. The health and well-being of animals. 2. The ethical obligation to protect and promote animal welfare in research. Factors affecting animal welfare include food, water, housing, climate, mental stimulation, and freedom from pain, suffering, disease, and disability. See also Three Rs. Asilomar Conference: A meeting of scientists, held in Asilomar, CA, in 1975, who were involved in development recombinant DNA techniques concerning the oversight of responsible use of this technology. The scientists recommended the development of safety protocols as a means of protecting laboratory workers and the public from harm. Assent: A subject’s affirmative agreement to participate in research. Assent may take place when the subject does not have the capacity to provide informed consent (e.g., the subject is a child or mentally disabled) but has the capacity to meaningfully assent. See Informed consent. Audit: A formal review of research records, policies, activities, personnel, or facilities to ensure compliance with ethical or legal standards or institutional policies. Audits may be conducted regularly, at random, or for-cause (i.e., in response to a problem). Author: A person who makes a significant contribution to a creative work. Many journal guidelines define an author as someone who makes a significant contribution to (1) research conception and design, (2) data acquisition, or (3) data analysis or interpretation and who drafts or critically reads the paper and approves the final manuscript. Authorship, ghost: Failing to list someone as an author on a work even though they have made a significant contribution to it. Authorship, honorary: Receiving authorship credit when one has not made a significant contribution to the work. Autonomy: (1) The capacity for self-governance, i.e., the ability to make reasonable decisions. (2) A moral principle barring interference with autonomous decision-making. See Decision-making capacity. Bad apples theory: The idea that most research misconduct is committed by individuals who are morally corrupt or psychologically ill. This idea can be contrasted with the view that social, financial, institutional, and cultural factors play a major role in causing research misconduct. See Culture of integrity. Belmont Report: A report issued by the US National Commission for the Protection of Human Subjects in Biomedical and Behavioral Research in 1979, which has had a significant influence over
432
Chapter 15 DNA and religion
human subjects research ethics, regulation, and policy. The report provided a conceptual foundation for the Common Rule and articulated three principles of ethics: respect for persons, beneficence, and justice. Beneficence: The ethical obligation to do good and avoid causing harm. See also Belmont Report. Benefit: A desirable outcome or state of affairs, such as medical treatment, clinically useful information, or self-esteem. In the oversight of human subjects research, money is usually not treated as a benefit. Bias: The tendency for research results to reflect the scientist’s (or sponsor’s) subjective opinions, unproven assumptions, political views, or personal or financial interests, rather than the truth or facts. See also Conflict of interest. Biobank: A repository for storing biological samples or data to be used in research. Biobanks usually require investigators or institutions to agree to certain conditions as a condition for sharing samples or data with them. Bioethics: The study of ethical, social, or legal issues arising in biomedicine and biomedical research. Censorship: Taking steps to prevent or deter the public communication of information or ideas. In science, censorship may involve prohibiting the publication of research or allowing publication only in redacted form (with some information removed). Citation amnesia: Failing to cite important work in the field in a paper, book, or presentation. Classified research: Research that the government keeps secret to protect national security. Access to classified research is granted to individuals with the appropriate security clearance on a need-toknow basis. Clinical investigator: A researcher involved in conducting a clinical trial. Clinical trial: An experiment designed to test the safety or efficacy of a type of therapy (such as a drug). Clinical trial, active-controlled: A clinical trial in which the control group receives a treatment known to be effective. The goal of the trial is to compare different treatments. Clinical trial, placebo-controlled: A clinical trial in which the control group receives a placebo. The goal of the trial is to compare a treatment with a placebo. Clinical trial, phases: Sequential stages of clinical testing, required by regulatory agencies, used in the development of medical treatments. Preclinical testing involves experiments on animals or cells to estimate safety and potential efficacy. Phase I trials are small studies (50e100 subjects) conducted in human beings for the first time to assess safety, pharmacology, or dosing. Phase I studies are usually conducted on healthy volunteers although some are conducted on patients with terminal diseases, such as cancer patients. Phase II trials are larger studies (500 or more subjects) conducted on patients with a disease to assess safety and efficacy and establish a therapeutic dose. Phase III trials are large studies (up to several thousand subjects) conducted on patients to obtain more information on safety and efficacy. Phase IV (or postmarketing) studies are conducted after a treatment has been approved for marketing to gather more information on safety and efficacy and to expand the range of the population being treated. Clinical trial, registration: Providing information about a clinical trial in a public registry. Most journals and funding agencies require that clinical trials be registered. Registration information includes the name of the trial, the sponsor, study design and methods, population, inclusion/exclusion criteria, and outcome measures.
Advantages of knowing DNA
433
Clinical utility: The clinical usefulness of information, e.g., for making decisions concerning diagnosis, prevention, or treatment. Coercion: Using force, threats, or intimidation to make a person comply with a demand. Collaboration agreement: An agreement between two or more collaborating research groups concerning the conduct of research. The agreement may address the roles and responsibilities of the scientists, access to data, authorship, and intellectual property. Commercialization: The process of developing and marketing commercial products (e.g., drugs, medical devices, or other technologies) from research. See also Copyrights, intellectual property, patents. Common law: A body of law based on judicial decisions and rulings. Common rule: The US Department of Health and Human Services regulations (45 CFR 46) for protecting human subjects, which has been adopted by 17 federal agencies. The Common rule includes subparts with additional protections for children, neonates, pregnant women and fetuses, and prisoners. Community review: A process for involving a community in the review of research conducted on members of the community. Some research studies include community advisory boards as a way of involving the community. Competence: The legal right to make decisions for one’s self. Adults are considered to be legally competent until they are adjudicated incompetent by a court. See Decision-making capacity. Compliance: In research, complying with laws, institutional policies, and ethical guidelines related to research. Conduct: Action or behavior. For example, conducting research involves performing actions related to research, such as designing experiments, collecting data, analyzing data, and so on. Confidentiality: The obligation to keep some types of information confidential or secret. In science, confidential information typically includes private data pertaining to human subjects, papers or research proposals submitted for peer review, personnel records, proceedings from misconduct inquiries or investigations, and proprietary data. See also Privacy. Conflict of interest (COI): A situation in which a person has a financial, personal, political, or other interest, which is likely to bias his or her judgment or decision-making concerning the performance of his or her ethical or legal obligations or duties. Conflict of interest, apparent or perceived: A situation in which a person has a financial, personal, political, or other interest that is not likely to bias his or her judgment or decision-making concerning the performance of his or her ethical or legal obligation or duties but that may appear to an outside observer to bias his or her judgment or decision-making. Conflict of interest, institutional: A situation in which an institution (such as a university) has financial, political, or other interests, which are likely to bias institutional decision-making concerning the performance of institutional ethical or legal duties. Conflict of interest, management: Strategies for minimizing the adverse impacts of a conflict of interest, such as disclosure, oversight, or recusal/prohibition. Consent: See Informed consent. Consequentialism: An approach to ethics, such as utilitarianism, which emphasizes maximizing good over bad consequences resulting from actions or policies. Continuing review: In human subject’s research, subsequent review of a study after it has been approved by an IRB. Continuing review usually happens on an annual basis.
434
Chapter 15 DNA and religion
Copyright: A right, granted by a government, which prohibits unauthorized copying, performance, or alteration of creative works. Copyright laws include a fair use exemption, which allows limited, unauthorized uses for noncommercial purposes. Correction (or errata): Fixing a minor problem with a published paper. A minor problem is one that does not impact the reliability or integrity of the data or results. Journals publish correction notices and identify corrected papers in electronic databases to alert the scientific community to problems with the paper. See also Retraction. Culture of integrity: The idea that the institutional culture plays a key role in preventing research misconduct and promoting research integrity. Strategies to promote a culture of integrity include education and mentoring in the responsible conduct of research; research policy development; institutional support for research ethics oversight, consultation, and curriculum development; and ethical leadership. Data: Recorded information used to test scientific hypotheses or theories. Data may include laboratory notebooks (paper or digital), field notes, transcribed interviews, spreadsheets, digital images, X-ray photographs, audio or video recordings, and outputs from machines (such as gas chromatographs or DNA sequencers). Original (or primary) data are drawn directly from the data source; secondary (or derived) data are based on the primary data. Data auditing: See Audit. Data imputation: Use of statistical methods to fill in or replace missing or lost data. Imputation is not considered to be fabrication if it is done honestly and appropriately. Data management: Practices and policies related to recording, storing, auditing, archiving, analyzing, interpreting, sharing, and publishing data. Data outlier: A data point that is more than two standards deviations from the mean. Removal of outliers without articulating a legitimate reason may constitute data falsification. Data and safety monitoring board (DSMB): A committee that monitors data from human subjects research to protect participants from harm and promote their welfare. DSMBs may recommend to an institutional review board that a study be stopped or altered. Data use agreement (DUA): An agreement between institutions for the sharing and use of research data. Deception: In human subjects research, using methods to deceive subjects about the goals and nature of a study or the methods, tests, interventions, or procedures used in the study. See also Placebo, Observer effect. Decision-making capacity (DMC): The ability to make sound decisions. DMC is often situational and comes in degrees: for example, a person may be able to order food from a menu but not be able to make a decision concerning complex medical treatment. Factors that can compromise DMC include mental illness or disability, extreme emotional stress, drugs, age, or serious physical illness. DMC is not the same as legal competence: a demented adult may be legally competent but lack DMC. Deidentified data or samples: Data or biological samples that have been stripped of information, such as name or medical record number, which personally identifies individuals. Deontology: An approach to ethics, such as Kantianism, which emphasizes adherence to rules or principles of conduct. Discrimination: Treating people differently based on irrelevant characteristics, such as skin color, ethnicity, or gender. Double-blinding: Processes used to prevent human research subjects and researchers from discovering who is receiving an experimental treatment versus a placebo. Double-blinding is used to control for the placebo effect.
Advantages of knowing DNA
435
Dual use research: Research that can be readily used for beneficial or harmful purposes. Duplicate publication: Republishing the same paper or data without proper acknowledgment. Emergency research: In human subjects research, research that is conducted when a subject who cannot provide informed consent faces a life-threatening illness that requires immediate treatment and has no available legally authorized representative to provide consent. The Food and Drug Administration has developed special rules for emergency research involving products that it regulates. Error: An unintended adverse outcome; a mistake. Ethical dilemma: A situation in which two or more potential actions appear to be equally justifiable from an ethical point of view, i.e., one must choose between the lesser of two evils or the greater of two goods. Ethical reasoning: Making a decision in response to a moral dilemma based on a careful and thorough assessment of the different options in light of the facts and circumstances and ethical considerations. Ethical relativism: The view that ethical standards are relative to a particular culture, society, historical period, etc. When in Rome, do as the Romans do. See Ethical universalism. Ethical theory: A set of statements that attempts to unify, systematize, and explain our moral experience, i.e., our intuitions or judgments about right/wrong, good/bad, etc. See also Kantianism, Utilitarianism, Virtue ethics. Ethical universalism: The view that the same standards of ethics apply to all people at all times. Ethics (or morals): (1) Standards of conduct (or behavior) that distinguish between right/wrong, good/bad, etc. (2) The study of standards of conduct. Ethics, applied: The study of ethics in specific situations, professions, or institutions, e.g., medical ethics, research ethics, etc. Ethics, meta: The study of the meaning, truth, and justification of ethical statements. Ethics, normative versus descriptive: Normative ethics studies the standards of conduct and methods of reasoning that people ought to follow. Descriptive ethics studies the standards of conduct and reasoning processes that people in fact follow. Normative ethics seeks to prescribe and evaluate conduct, whereas descriptive ethics seeks to describe and explain conduct. Disciplines such as philosophy and religious studies take a normative approach to ethics, whereas sociology, anthropology, psychology, neuroscience, and evolutionary biology take a descriptive approach. Exculpatory language: Language in an informed consent form, contract, or other document intended to excuse a party from legal liability. Exempt research: Human subjects research that is exempted from review by an institutional review board. Some types of exempt research include research on existing human samples or data in which the researcher cannot readily identify individuals and anonymous surveys of individuals. Expedited review: In human subjects research, review of a study by the chair of an institutional review board (or designee) instead of by the full board. Expedited review may be conducted on new studies that pose minimal risks to subjects, for continuing review in which a study is no longer recruiting subjects, or on amendments to approved studies that make only minor changes. Exploitation: Taking unfair advantage of someone else. Expression of concern: A journal may publish an expression of concern when a paper has come under suspicion for wrongdoing or is being investigated for possible research misconduct. Fabrication: Making up data or results.
436
Chapter 15 DNA and religion
Falsification: Changing, omitting, or manipulating data or results deceptively or deceptive manipulation of research materials or experiments. Food and Drug Administration (FDA): A federal agency in charge of approving the marketing of drugs, biologics, medical devices, cosmetics, and food additives. The FDA has adopted human subjects research regulations that are similar to the Common rule; however, the FDA rules do not allow exceptions from informed consent requirements unless a study qualifies as emergency research. Fraud: Knowingly misrepresenting the truth or concealing a material (or relevant) fact to induce someone to make a decision to his or her detriment. Some forms of research misconduct may also qualify as fraud. A person who commits fraud may face civil or criminal legal liability. Freedom of Information Act (FOIA): A law enacted in the United States and other countries that allows the public to obtain access to government documents, including documents related to government-funded scientific research, such as data, protocols, and emails. Several types of documents are exempt from FOIA requests, including classified research and confidential information pertaining to human subjects research. Good clinical practices (GCPs): Rules and procedures for conducting clinical trials safely and rigorously. Good laboratory practices (GLPs): Rules and procedures for designing and performing experiments or tests and recording and analyzing data rigorously. Some types of research are required by law to adhere to GLPs. Good manufacturing practices (GMPs): Rules and procedures for manufacturing a product (such as a drug) according to standards of quality and consistency. Good record-keeping practices (GRKPs): Rules and procedures for keeping research records. Records should be thorough, accurate, complete, organized, signed and dated, and backed up. Guideline: A nonbinding recommendation for conduct. Harassment: Repeatedly annoying, bothering, or intimidating someone. Harassment, sexual: Harassment involving unwelcome sexual advances or remarks or requests for sexual favors. Helsinki Declaration: Ethical guidelines for conducting medical research involving human subjects research adopted by the World Medical Association. Honesty: The ethical obligation to tell the truth and avoid deceiving others. In science, some types of dishonesty include data fabrication or falsification, and plagiarism. Human subjects research: Research involving the collection, storage, or use of private data or biological samples from living individuals by means of interactions, interventions, surveys, or other research methods or procedures. Incidental finding: Information inadvertently discovered during medical treatment or research, which was not intentionally sought. For example, if a research subject receives an MRI as part of brain imaging study and the researcher notices an area in the fontal cortex that appears to be a tumor, this information would be an incidental finding. Individualized research results: In human subjects research, results pertaining to a specific individual in a study, such as the subject’s pulse, blood pressure, or the results of laboratory tests (e.g., blood sugar levels, blood cell counts, genetic or genomic variants). Individualized results may include intended findings or incidental findings. There is an ongoing ethical controversy concerning whether, when, and how individualized research results should be shared with human subjects research. Some argue that individualized results should be returned if they are based on accurate and reliable tests and
Advantages of knowing DNA
437
have clinical utility, because inaccurate, unreliable, or uncertain results may be harmful. Others claim that the principle of autonomy implies that subjects should be able to decide whether to receive their results. Informed consent: The process of making a free and informed decision (such as to participate in research). Individuals who provide informed consent must be legally competent and have enough decision-making capacity to consent to research. Research regulations specify the types of information that must be disclosed to the subject. See also Assent. Informed consent, blanket (general): A provision in an informed consent document that gives general permission to researchers to use the subject’s data or samples for various purposes and share them with other researchers. Informed consent, documentation: A record (such as a form) used to document the process of consent. Research regulations require that consent be documented; however, an institutional review board may decide to waive documentation of consent if the research is minimal risk and (1) the principle risk of the study is breach of confidentiality and the only record linking the subject to the study is the consent form or (2) the research involves procedures that normally do not require written consent outside of the research context. Informed consent, specific: A provision in an informed consent document that requires researchers to obtain specific permission from the subject prior to using samples or data for purposes other than those that are part of the study or sharing them with other researchers. Informed consent, tiered: Provisions in an informed consent document that give the subject various options concerning the use and sharing of samples or data. Options may include blanket consent, specific consent, and other choices. Informed consent, waiver: In human subject research, the decision by an institutional review board to waive (or set aside) some or all of the informed consent requirements. Waivers are not usually granted unless they are necessary to conduct the research and pose minimal risks to the subjects. Institutional animal care and use committee (IACUC): A committee responsible for reviewing and overseeing animal research conducted at an institution. IACUCs usually include members from different backgrounds and disciplines, with institutional and outside members, scientists, and nonscientists. Institutional review board (IRB): A committee responsible for reviewing and overseeing human subjects research. An IRB may also be called a research ethics committee (REC) or research ethics board (REB). IRBs usually include members from different backgrounds and disciplines, with institutional and outside members, scientists, and nonscientists. Intellectual property: Legally recognized property pertaining to the products of intellectual activity, such as creative works or inventions. Forms of intellectual property include copyrights on creative works and patents on inventions. Justice: (1) Treating people fairly. (2) An ethical principle that obligates one to treat people fairly. Distributive justice refers to allocating benefits and harms fairly; procedural justice refers to using fair processes to make decisions that affect people; formal justice refers to treating similar cases in the same way. In human subject research, the principle of justice implies that subjects should be selected equitably. See also Belmont Report. Kantianism: An ethical theory developed by German philosopher Immanuel Kant (1724e1804), which holds that the right thing to do is to perform one’s duty for duty’s sake. One’s duty is defined by an ethical principle known as the categorical imperative (CI). According to one version of the CI, one
438
Chapter 15 DNA and religion
should act according to a maxim that could become a rule for all people. According to another version, one should always treat people as having inherent moral value (or dignity) and never only as objects or things to be used to achieve some end. Law: A rule enforced by the coercive power of the government. Laws may include statutes drafted by legislative bodies (such as Congress), regulations developed and implemented by government agencies, and legal precedents established by courts, i.e., common law. Legal authorized representative (LAR): A person, such as a guardian, parent of a minor child, healthcare agent, or close relative, who is legally authorized to make decisions for another person when they cannot make decisions for themselves. LARs may also be called surrogate decision-makers. See Competence, Decision-making capacity. Material transfer agreement (MTA): An agreement between institutions for the transfer and use of research materials, such as cells or reagents. Media embargo: A policy, adopted by some journals, which allows journalists to have access to a scientific paper prior to publication, provided that they agree not to publicly disclose the contents of the paper until it is published. Some journals will refuse to publish papers that have already appeared in the media. Mentor: Someone who provides education, training, guidance, critical feedback, or emotional support to a student. In science, a mentor may be the student’s advisor but need not be. Minimal risk: A risk that is not greater than the risk of routine medical or psychological tests or exams or the risk ordinarily encountered in daily life activities. Misconduct: See Research misconduct. Mismanagement of funds: Spending research funds wastefully or illegally; for example, using grant funds allocated for equipment to pay for travel to a conference. Some types of mismanagement may also constitute fraud or embezzlement. Morality: See Ethics. National Science Foundation (NSF), Office of Inspector General (OIG): An NSF office that oversees the integrity of NSF-funded research. OIG reviews reports of research misconduct inquiries and investigations conducted by institutions and investigations of other problems, such as mismanagement of funds. Nazi research on human subjects: Heinous experiments conducted on concentration camp prisoners, without their consent, during World War II. Many of the subjects died or received painful and disabling injuries. Experiments included wounding prisoners to study healing; infecting prisoners with diseases to test vaccines; and subjecting prisoners to electrical currents, radiation, and extremes of temperature or pressure. Negligence: A failure to follow the standard of care that results in harm to a person or organization. In science, research that is sloppy, careless, or poorly planned or executed may be considered negligent. Noncompliance: The failure to comply with research regulations, institutional policies, or ethical standards. Serious or continuing noncompliance in human subjects research should be promptly reported to the institutional review board and other authorities. See Compliance. Nuremberg Code: The first international ethics code for human subjects research, adopted by the Nuremberg Council during the war crimes tribunals in 1947. The code was used as a basis for convicting Nazi physicians and scientists for war crimes related to their experiments on concentration camp prisoners.
Advantages of knowing DNA
439
Objectivity: (1) The tendency for the results of scientific research to be free from bias. (2) An ethical and epistemological principle instructing one to take steps to minimize or control for bias. Observer (or Hawthorne) effect: The tendency for individuals to change their behavior when they know they are being observed. Some social science experiments use deception to control for the observer effect. Openness: The ethical obligation to share the results of scientific research, including data and methods. Office of Human Research Protections (OHRP): A federal agency that oversees human subjects research funded by the Department of Health and Human Services, including research funded by the National Institutes of Health. The OHRP publishes guidance documents for interpreting the Common rule, sponsors educational activities, and takes steps to ensure compliance with federal regulations, including auditing research and issuing letters to institutions concerning noncompliance. Office of Research Integrity (ORI): A US federal agency that oversees the integrity of research funded by the Public Health Service, including research funded by the National Institutes of Health. ORI sponsors research and education on research integrity and reviews reports of research misconduct inquiries and investigations from institutions. Patent: A right, granted by a government, which allows the patent holder to exclude others from making, using, or commercializing an invention for a period of time, typically 20 years. To be patented, an invention must be novel, nonobvious, and useful. The patent holder must publicly disclose how to make and use the invention in the patent application. Paternalism: Restricting a person’s decision-making for their own good. In soft paternalism, one restricts the choices made by someone who has a compromised ability to make decisions (see Decision-making capacity); in hard paternalism, one restricts the choices made by someone who is fully autonomous (see Autonomy). Peer review: The process of using experts within a scientific or academic discipline (or peers) to evaluate articles submitted for publication, grant proposals, or other materials. Peer review, double-blind: A peer-review process in which neither the authors nor the reviewers are told each other’s identities. Peer review, open: A peer-review process in which the authors and reviewers are told each other’s identities. Peer review, single-blind: A peer-review process, used by most scientific journals, in which the reviewers are told the identities of the authors but not vice versa. Placebo: A biologically or chemically inactive substance or intervention given to a research subject which is used to control for the placebo effect. Placebo effect: A person’s psychosomatic response to the belief that they are receiving an effective treatment. Researchers may also be susceptible to the placebo effect if they treat subjects differently who they believe are receiving effective treatment. See also Double-blinding. Plagiarism: Misrepresenting someone else’s creative work (e.g., words, methods, pictures, ideas, or data) as one’s own. See also Research misconduct. Plagiarism, self: Reusing one’s own work without proper attribution or citation. Some people do not view self-plagiarism as a form of plagiarism because it does not involve intellectual theft. Politics: (1) Activities associated with governance of a country. (2) The science or art of government. (3) The study of government.
440
Chapter 15 DNA and religion
Precautionary principle (PP): An approach to decision-making that holds that we should take reasonable measures to prevent, minimize, or mitigate harms that are plausible and serious. Some countries have used the PP to make decisions concerning environmental protection or technology development. See also Risk/benefit analysis, Risk management. Preponderance of evidence: In the law, a standard of proof in which a claim is proven if the evidence shows that it is more likely true than false (i.e., probability >50%). Preponderance of evidence is the legal standard generally used in research misconduct cases. This standard is much lower than the standard used in criminal cases, i.e., proof beyond reasonable doubt. Privacy: A state of being free from unwanted intrusion into one’s personal space, private information, or personal affairs. See also Confidentiality. Proprietary research: Research that a private company owns and keeps secret. Protocol: A set of steps, methods, or procedures for performing an activity, such as a scientific experiment. Protocol, deviation: A departure from a protocol. In human subjects research, serious or continuing deviations from approved protocols should be promptly reported to the institutional review board. Publication: The public dissemination of information. In science, publication may occur in journals or books, in print or electronically. Abstracts presented at scientific meetings are generally considered to be a form of publication. Publication bias: Biases related to the tendency publish or not publish certain types of research. For example, some studies have documented a bias toward publishing positive results. Quality control/quality assurance: Processes for planning, conducting, monitoring, overseeing, and auditing an activity (such as research) to ensure that it meets appropriate standards of quality. Questionable research practices (QRPs): Research practices that are regarded by many as unethical but are not considered to be research misconduct. Duplicate publication and honorary authorship are considered by many to be QRPs. Randomization: A process for randomly assigning subjects to different treatment groups in a clinical trial or other biomedical experiment. Randomized controlled trial (RCT): An experiment, such as a clinical trial, in which subjects are randomly assigned to receive an experimental intervention or a control. Regulation: (1) A type of law developed and implemented by a government agency. (2) The process of regulating or controlling some activity. Reliance agreement: An agreement between two institutions in which one institution agrees to oversee human subjects research for the other institution for a particular study or group of studies. Remuneration: In human subjects research, providing financial compensation to subjects. Reproducibility: The ability for an independent researcher to achieve the same results of an experiment, test, or study under the same conditions. A research paper should include information necessary for other scientists to reproduce the results. Reproducibility is different from repeatability, in which researchers repeat their own experiments to verify the results. Reproducibility is one of the hallmarks of good science. Research: A systematic attempt to develop new knowledge. Research compliance: See Compliance. Research ethics: (1) Ethical conduct in research. (2) The study of ethical conduct in research. See Responsible conduct of research.
Advantages of knowing DNA
441
Research integrity: Following ethical standards in the conduct of research. See Research ethics. Research institution: An institution, such as a university or government or private laboratory, which is involved in conducting research. Research integrity official (RIO): An administrator at a research institution who is responsible for responding to reports of suspected research misconduct. Research misconduct: Intentional, knowing, or reckless behavior in research that is widely viewed as highly unethical and often illegal. Most definitions define research misconduct as fabrication or falsification of data or plagiarism, and some include other behaviors in the definition, such as interfering with a misconduct investigation, significant violations of human research regulations, or serious deviations from commonly accepted practices. Honest errors and scientific disputes are not regarded as misconduct. Research misconduct, inquiry versus investigation: If suspected research misconduct is reported at an institution, the research integrity official may appoint an inquiry committee to determine whether there is sufficient evidence to conduct an investigation. If the committee determines that there is sufficient evidence, an investigative committee will be appointed to gather evidence and interview witnesses. The investigative committee will determine whether there is sufficient evidence to prove misconduct and make a recommendation concerning adjudication of the case to the research integrity official. Research sponsor: An organization, such as a government agency or private company, which funds research. Research subject (also called research participant): A living individual who is the subject of an experiment or study involving the collection of the individual’s private data or biological samples (see also Human subjects research). Respect for persons: A moral principle, with roots in Kantian philosophy, which holds that we should respect the choices of autonomous decision-makers (see Autonomy, Decision-making capacity) and that we should protect the interests of those who have diminished autonomy (see Vulnerable subject). See also Belmont Report. Responsible conduct of research (RCR): Following ethical and scientific standards and legal and institutional rules in the conduct of research. See also Research ethics, Research integrity. Retraction: Withdrawing or removing a published paper from the research record because the data or results have subsequently been found to be unreliable or because the paper involves research misconduct. Journals publish retraction notices and identify retracted papers in electronic databases to alert the scientific community to problems with the paper. See Correction. Right: A legal or moral entitlement. Rights generally imply duties or obligations. For example, if A has a right not be killed, then B has a duty not to kill A. Risk: The product of the probability and magnitude (or severity) of a potential harm. Risk/benefit analysis: A process for determining an acceptable level of risk, given the potential benefits of an activity or technology. See also Risk Management, Precautionary principle. Risk management: The process of identifying, assessing, and deciding how best to deal with the risks of an activity, policy, or technology. See also Precautionary principle. Risk minimization: In human subjects research, the ethical and legal principle that the risks to the subjects should be minimized using appropriate methods, procedures (such as Subject selection rules), or other safety measures (such as a data and safety monitoring board).
442
Chapter 15 DNA and religion
Risks, reasonable: In human subjects research, the ethical and legal principle that the risks to the subjects should be reasonable in relation to the benefits to the subjects or society. See Risk/benefit analysis, Social value. Salami science: Dividing a scientific project into the smallest papers that can be published (least publishable unit) to maximize the total publications from the project. See Questionable research practices. Scientific (or academic) freedom: The institutional and government obligation to refrain from interfering in the conduct or publication of research, or the teaching and discussion of scientific ideas. See Censorship. Scientific validity (or rigor): Processes, procedures, and methods used to ensure that a study is well designed to test a hypothesis or theory. Self-deception: In science, deceiving one’s self in the conduct of research. Self-deception is a form of bias that may be intentional or unintentional (subconscious). Self-regulation: Regulation of an activity by individuals involved in that activity as opposed to regulation by the government. See also Law. Singapore statement: An international research ethics code developed at the 2nd World Conference on Research Integrity in Singapore in 2010. Social responsibility: In science, the obligation to avoid harmful societal consequences from one’s research and to promote good ones. Social value: (1) The social benefits expected to be gained from a scientific study, such as new knowledge or the development of a medical treatment or other technology. (2) The ethical principle that human subjects research should be expected to yield valuable results for society. Speciesism: The idea, defended by philosopher Peter Singer, that treating human beings as morally different from animals is a form of discrimination like racism. Singer argues that since all animals deserve equal moral consideration, most forms of animal experimentation are unethical. See Value, scale of. Standard operating procedures (SOPs): Rules and procedures for performing an activity, such as conducting or reviewing research. Statistical significance: A measure of the degree that an observed result (such as relationship between two variables) is due to chance. Statistical significance is usually expressed as a P-value. A Pvalue of 0.05, for example, means that the observed result will probably occur as a result of chance only 5% of the time. Subject selection: Rules for including/excluding human subjects in research. Subject selection should be equitable, i.e., subjects should be included or excluded for legitimate scientific or ethical reasons. For example, a clinical trial might exclude subjects who do not have the disease under investigation or are too sick to take part in the study safely. See Risk minimization, Justice. Surrogate decision-maker: See Legal authorized representative. Testability: The ability to test a hypothesis or theory. Scientific hypotheses and theories should be testable. Therapeutic misconception: (1) The tendency for human subjects research in clinical research to believe that the study is designed to benefit them personally. (2) The tendency for the subjects of clinical research to overestimate the benefits of research and underestimate the risks. Three R’s: Ethical guidelines for protecting animal welfare in research, including reduction (reducing the number of animals used in research), replacement (replacing higher species with lower
Advantages of knowing DNA
443
ones or animals with cells or computer models), and refinement (refining research methods to minimize pain and suffering). Transparency: In science, openly disclosing information that concerned parties would want to know, such as financial interests or methodological assumptions. See also Conflict of interest, management. Tuskegee Syphilis Study: A study, sponsored by the US Department of Health, Education, and Welfare, conducted in Tuskegee, Alabama, from 1932 to 1972, which involved observing the progression of untreated syphilis in African American men. The men were not told they were in a research study; they thought they were getting treatment for “bad blood.” Researchers also steered them away from clinics where they could receive penicillin when it became available as a treatment for syphilis in the 1940s. Unanticipated problem (UP): An unexpected problem that occurs in human subjects research. Serious UPs that are related to research and suggest a greater risk of harm to subjects or others should be promptly reported to institutional review boards and other authorities. Undue influence: Taking advantage of someone’s vulnerability to convince them to make a decision. Utilitarianism: An ethical theory that holds that the right thing to do is to produce the greatest balance of good/bad consequences for the greatest number of people. Act utilitarianism focuses on good resulting from particular actions, whereas rule utilitarianism focuses on happiness resulting from following rules. Utilitarianism may equate the good with happiness, satisfaction of preferences, or some other desirable outcomes. See also Consequentialism, Ethical theory. Value: Something that is worth having or desiring, such as happiness, knowledge, justice, or virtue. Value, conflict: An ethical dilemma involving a conflict among different values. Value, instrumental: Something that is valuable for the sake of achieving something else, e.g., a visit to the dentist is valuable for dental health. Value, intrinsic: Something that is valuable for its own sake, e.g., happiness, human life. Value, scale of: The idea that some things can be ranked on a scale of moral value. For example, one might hold that human beings are more valuable than other sentient animals; sentient animals are more valuable than nonsentient animals, etc. Some defenders of animal experimentation argue that harming animals in research can be justified to benefit human beings because human beings are more valuable than animals. Virtue: A morally good or desirable character trait, such as honesty, courage, compassion, modesty, and fairness. Virtue ethics: An ethical theory that emphasizes developing virtue as opposed to following rules or maximizing good/bad consequences. Voluntariness: The ability to make a free (uncoerced) choice. See Coercion, Informed consent. Vulnerable subject: A research subject who has an increased susceptibility to harm or exploitation due to his or her compromised ability to make decisions or advocate for his or her interests or his or her dependency. Vulnerability may be based on age, mental disability, institutionalization, language barriers, socioeconomic deprivation, or other factors. See Decision-making capacity, Informed consent. Whistleblower: A person who reports suspected illegal or unethical activity, such as research misconduct or noncompliance with human subjects or animal regulations. Various laws and institutional policies protect whistleblowers from retaliation.
444
Chapter 15 DNA and religion
Withdrawal: Removing a human subjects research from a study. Subjects may voluntarily withdraw or be withdrawn by the researcher to protect them from harm or ensure the integrity of the study. Subjects who withdraw from a study may request to have their samples removed from the study (i.e., destroyed).
Suggested readings Advances in DNA storage https://twistbioscience.com/company/blog/twistbiosciencedatastorageram. Advances in Macromolecular Storage https://www.spiedigitallibrary.org/conference-proceedings-of- spie/9201/1/ Advances-in-macromolecular-data-storage/10.1117/12.2060549.short?SSO¼1. Allah’s Miracles of the Holy Qur’an https://www.goodreads.com/author/quotes/164170.Harun_Yahya?page¼2. DNA and Darwin https://www.newscientist.com/article/mg23130880-400-the-odd-couple-how-evolution-andgenetics-finally-got-together/. DNA testing science vs Religion https://www.easy-dna.com/knowledge-base/dna-testing-science-vs-religion/#. Ethics and Religion Talks https://therapidian.org/ethics-and-religion-talk-dna-privacy-and-misuse-information. Embryonic development, 2001. L. Wolpert: The Triumph of Embryo. Oxford University Press, 1991. Harvard University Press. Also Discussed in Detail with Original Pictures by Haeckel in D Bainbridge: Making Babies. God is the creator https://factsandtrends.net/2018/08/28/god-is-the-creator-highly-religious-less-likely-to-supportgenetic- engineering/. G.R. Mayes, “DNA of a Revolution: The Small Group Experience: Dream Together about the Church that Could Be and Unleash the Adventure of Going There Together”, Long Wake Publishing; first ed. (May 28, 2015). Islamic Supreme Council of America http://www.islamicsupremecouncil.org/publications/articles/55-the- quransmiraculous-relevance.html. M.V. Gordon, “Tainted DNA: Pursuing Our True Spiritual Design (Tainted DNA Series)”, Snow Owl Publishing (May 29, 2019). Meyer, S.C., June 6, 2009. Signature in the Cell: DNA and the Evidence for Intelligent Design, Reprint edition. Harper One. Meyer, S.C., June 28, 2015. “Debating Darwin’s Doubt: A Scientific Controversy that Can No Longer Be Denied ”, first ed. Discovery Institute Press. New York DNA https://www.nytimes.com/topic/subject/dna. Religion and DNA https://www.americamagazine.org/faith/2018/02/01/how-your-dna-points-existence-andintricacy-god. RELIGION Versus Genetics https://www.patheos.com/blogs/religioustolerance/2018/01/genetics-vs-religion/. Some Laws and Principles in Evolutionary Biology (from Library of MERIT CyberSecurity). We are on the cusp of the gene revolution https://www.newscientist.com/article/mg23130842-900-were- on-thecusp-of-a-gene-editing-revolution-are-we-ready/. www.termanini.com.
Index Note: ‘Page numbers followed by “f” indicate figures and “t” indicates tables.’
A Advanced encryption standard (AES), 319e320 Allen’s rule, 426 Amazon’s flying warehouses, 156 Amino acid, 22e24 Anomie, 345e346 Antivirus technologies (AVTs), 43, 80, 299e300 “Apocalypse-proof”, 6 Artificial gene synthesis, 155 Artificial intelligence, 124e126, 197e198 ASCII, 112e116 A-train, 236 Audio media features, 185e186 Autonomic adapter, 66 Autonomic self-regulating process, 42
B Bachand DNA story, 202e204 Bancroft encoding method, 142 Bateman’s principle, 426 Beijernek’s principle, 426 Bergmann’s rule, 426 Binary code, 3f, 183e184 Binary coded-decimal (BCD), 114 Binary coding/decoding academic research Columbia University, 267 Harvard University, 267 Swiss Federal Institute of Technology (ETH), Zurich, 268 University of Illinois, UrbanaeChampaign, 267e268 University of Washington, 267 address, 273 binary, 270e273, 271fe272f binary convertion, 282e283 binary malware, inject into DNA, 280e282 BioEdit software, 276 Blockchain malware, 283 catalog, 269 central dogma of genetics, 265e266, 266f Columbia University, 267 dark web, 280 deep web, 279e280 Defense Advanced Research Projects Agency’s (DARPA’s), 269 DNA malware, 280e283
DNA storage, 267 DNA synthesis/sequencing, 267 dynamic equilibrium, 261e263 European Bioinformatics Institute, 270 foreign research, 270 grasshoppers rainstorm, 283 Harvard University, 267 Helixworks, 269 industry, 268 infected DNA, 282e283 Intelligence Advanced Research Projects Activity, 269e270 Iridia, 269 malware technology, 278e280, 280f micron technology, 268 microsoft research, 268 molecular information storageMIST story, 275e276 Molecular Information Systems Lab (MISL), 267 National Institutes of Health, 270 National Institutes of Health (NIH), 270 National Science Foundation (NSF), 267, 270 next-generation sequencing (NGS), 277e278 Payload, 273 polymerase chain reaction (PCR), 273 copying DNA sequences, 273e276 primers, 273 random access, 273 research consortium, 268 Semiconductor Research Corporation (SRC), 268 sequence editing, 276 start-ups, 269 structure DNA, 264e265 surface web, 279 Swiss Federal Institute of Technology (ETH), Zurich, 268 TA1 (storage), 275 TA2 (retrieval), 275 TA3 (operating system), 275 two-way malware, 280 University of Illinois, UrbanaeChampaign, 267e268 Binary convertion, 282e283 Binary malware, inject into DNA, 280e282 Binary to DNA-encoded data, 187e193 Biochemistry-based information technology, 382e389 BioEdit software, 276 Biohacking, 242e243 Bioinformatics, 183, 200e223 Biological clock, 393
445
446
Index
Biological determinism, 346 Biological profiling, 23e24 BLAST software, 182, 201 architecture, 182e183 Blockchain malware, 283 Blockchains (BCs), 234 A-train, 236 competitionethereum, 238 disadvantages, 238e240 malware, 241 Buffer overflow launch pad, 194 Sherlock software, 194 spilled data, 193 Bulmer effect, 426
C Cannabis dilemma, 245 Causality, 52e55 reasoning engine, 55e56, 56f Central coordination center (CCC), 48, 298e300 Central dogma, 18 genetics, 19, 265e266, 266f Chromosomes, 20 Church’s DNA storage, 156 Client system, 66 Cloud data center, 108 Clustered regularly interspaced short palindromic repeats (CRISPR), 25e27, 153, 195e196, 415e418 ASCII, 112e116 data storage capacity, 102 digital universe, 98e102 DNA, 97 data storage in, 118e119 digital storage, 109e111 genetic code to DNA binary code, 111 hack, 120e122 Dubai embracing DNA storage, 105e106 magical smart city, 103 five types of data centers, 107e108 gopher message, 117e118 Holy Grail of data deluge, 126 Huffman compression rule, 117 hyper data center of world, 106e107 hyperscale data centers, 108e109 magic, 119e120 smart cleaver, 122e123 Coding malware, 192e193 Coefficient of relatedness, 426 Cognitive/artificial intelligence (AI) systems, 101 Cognitive early warning predictive system (CEWPS), 298
anatomical composition of causality, 54 causality reasoning engine, 55e56, 56f central coordination center, 48 cybercrime raw data distillation process, 52e54 experience, 49e51 knowledge, 49, 51 prediction, 55 reasoning engine, 54 reverse engineering center, 57 six stages of cybercrime episode, 51e52 smart city critical infrastructure, 57 Smart Vaccine center, 62 vaccine knowledge base, 63 virus knowledge base, 63 intelligent smart shield, 45 and its intelligent components, 47 smart Nanogrid, 64 Colocation data center, 108 Columbia University, 267 Combined DNA Identification System (CODIS), 422 Compact disc (CD), 185e186 Competitionethereum, 238 “Contiguous” chemical building blocks, 20 Cope’s law, 426 Cope’s rule, 426 Creative mind of hacker, 194e195 Crime-as-a-service (CaaS), 84 Criminal minds, 349 CRISPR-associated protein (CAS) enzymes, 26 CRISPR-Cas9 molecules, 83, 83f, 124e126 Critical infrastructures, 59e64 in smart cities, 57e59 Cryptology glossary, 321e322 Cybercrime biochemistry-based information technology, 382e389 cyberterrorism, 373f, 374e377 data repositories, 368e369 digital storage, 370e371 DNA computing (DNAC), 377, 381 episode, 51e52 six stages of, 51e52 eradication, 349 Human Cell Atlas (HCA), 379e381 National Medical DNA, 378 raw data distillation process, 52e54 smart city, 371e374, 372f Smart City Citizens Database, 378 Smart City Citizen Violence and Crime Database, 378 Smart City DNA Operations Storage and Archive, 378 storage demand, 369e370 storage supply, 369e370
Index
Cybercriminal behavior, 365 Cybersecurity, 120 Cyberterrorism, 366
D “Dark Lady of DNA”, 8 Dark web, 280 Databroker (DAO), 325e327 Datacenter power consumption, 353 Data encoding, 142, 146 Data encryption standards (DES), 318e320 Data nomenclature, 140e141 Data repositories, 368e369 Data storage capacity, 102 Debuggers, 302 DeepLocker, 197e198 Deep neural networks, 204e205, 206 Deep web, 279e280 Defense Advanced Research Projects Agency’s (DARPA’s), 269 Density, 329 Deoxyribonucleic acid, 2 Digital DNA sequencing, 233 Digital immunity ecosystem (DIE), 39e41 advanced encryption standard (AES), 319e320 anatomical composition, 299e308 anatomy, 298 antivirus technology (AVT), 299e300 causality reasoning and predictor, 302e303 central coordination center (CCC), 298e300 CEWPS anatomical composition of, 48e57 intelligent smart shield, 45 and its intelligent components, 47 Cognitive Early Warning Predictive System (CEWPS), 298 smart nanogrid, 308, 308f critical infrastructures, 59e64 in smart cities, 57e59 cryptology glossary, 321e322 data encryption standards (DES), 318e320 debuggers, 302 disassemblers, 301 DNA component, 309e310 DNA computer, 314e315 DNA computing, 313e314 applications, 314 DNA cryptology, 311 DNA digital storage, 297, 297fe298f DNA encoding, 320e321 3D nanoattack scenario, 46e47 Dubai digital data forecast, 315e317, 316f encryption algorithm, 312, 312f
hex editors, 302 intelligent components, 298 knowledge acquisition component, 300, 301f machine learning component, 302e303 one-time pad (OTP), 317e318 plague of Athens, 430 BC, 296e297, 297f Plague, Siege of Caffa, 1346, 295e296 portable executive, 302 resource viewer, 302 reverse engineering center, 301e302 smart cities, 42e45 smart city critical infrastructures, 303, 303fe304f smart grid model, 65e68, 308e311 smart vaccine, 42 smart vaccine center, 305, 305fe306f time, 311 vaccine knowledge base (VaKB), 301, 306 virus knowledge base (ViKB), 301, 307, 307f Digital storage, 370e371 Digital universe, 2, 4, 98e102 Digital video, 186e187 DNA, 97 an organic data castle, 6 attack on computer, 199e200 code of life, 2e3 Columbus discovery, 3e4 computer, quantum computer vs., 191e192 crime, 84e85 cyber hacker, 196e197 data storage, 346, 353 data storage in, 118e119 digital data hacking, 84 digital storage, 109e111 discovered, 9 fingerprint, 421 Fountain, 178 genetic code to DNA binary code, 111 hack, 120e122 hacking with nanorobots, 199 next generation of, 195 malware, 280e283 malware-as-a-service, 198e199 music, 8 pioneers, 4e6 sequencing, 114, 267 synthesis, 185, 267 writer, 173e174 DNA computing (DNAC), 184, 381 DNA-encoded data adleman discovery, 180e181 artificial intelligenceepowered malware, 197e198
447
448
Index
DNA-encoded data (Continued ) Bachand DNA story, 202e204 binary code, 183e184 binary to, 187e193 bioinformatics, 200e223 BLAST algorithm software, 182 BLAST software architecture, 182e183 buffer overflow launch pad, 194 Sherlock software, 194 spilled data, 193 computing, 180 creative mind of hacker, 194e195 CRISPR, 195e196 deep neural networks, 204e205 digital video, 186e187 DNA attack, computer, 199e200 DNA cyber hacker, 196e197 DNA hacking with nanorobots, 199 DNA writer, 173e174 FASTA software system, 183 FASTQ software, 202 fountain software architecture, 177e178 strategy, 174e177 IBM’s DeepLocker, 201e202 Jian-Jun Shu discovery, 181e182 nanopowered malware, 198e199 next generation of DNA hacking, 195 nondeterministic universal turing machine, 185 reliable and efficient DNA storage architecture, 178e179 Sherlock detective software, 206 video and audio media features, 185e186 DNA-encrypted data Marlene and George Bachand pioneers of, 202e204 3D nanoattack scenario, 46e47 DNA storage, 139e140, 267 Amazon’s flying warehouses, 156 artificial gene synthesis, 155 Bancroft encoding method, 142 Church’s DNA storage, 156 computing, 156e157 Databroker (DAO), 325e327 data nomenclature, 140e141 DNA potentialities, 329e331, 331f encoding method, 146 Goldman encoding method, 144, 145f, 151 hardware to bioware, 327e329, 328f Huffman encoding method, 142e143 information communication technology (ICT) infrastructure, 323e324
Intelligence Advanced Research Projects Activity (IARPA), 324 mathematical ideas, 139 molecular information storage (MIST), 324e325 new supercomputer, 157e158 power usage effectiveness (PUE), 329 with random access, 146 random access method, 141e142 random access retrieval, 397e399 reliability vs. density, 150e152 Rosetta stone, 153e154 Silicon, 154 simulation method, 146e150 smart data, 325e327 tunable (balanced) redundancy method, 146 XOR encoding method, 145e146 DNA stuxnet (DNAXNET), 82e84 Droplets, 174 Dubai digital data forecast, 315e317, 316f DVD disk, 186 D-wave systems, 190e191 Dynamic equilibrium, 261e263
E Encoding method, 146 Encryption algorithm, 312, 312f Enterprise data center, 107 Enzyme, 25, 27 Epigenetics, 6 “Escape” sequence, 113 European Bioinformatics Institute, 270 Existing encoding methods Bancroft encoding method, 142 encoding method, 146 Goldman encoding method, 144, 145f, 151 Huffman encoding method, 142e143 tunable (balanced) redundancy method, 146 XOR encoding method, 145e146
F Facebook, 328 FASTA software system, 183 FASTQ software, 202 File recovery (back to binary) method, 148 Fisher’s fundamental theorem, 426 Fisher’s theorem of the sex ratio, 426 Foreign research, 270 Fountain software architecture, 177e178 strategy, 174e177
Index
G Galton’s regression law, 427 Gause’ rule, 427 Gene editing, 24e25, 418 Gene hacking, 85e87 Genetic code, 20e22, 20fe21f properties of, 22e23 Genetic complexity, 346 Genetics, 2, 20 Genomes, 424 Goldman encoding method, 144, 145f, 151 Google, 328 Grasshoppers rainstorm, 283
H Hack computers, 29e30 Hacking DNA genes, 79e80 DNA stuxnet (DNAXNET), 82e84 MyHeritage website leakage, 87e88 new genetic editing tools DNA crime, 84e85 DNA digital data hacking, 84 gene hacking, 85e87 Stuxnet, 80e82 crack in door, 90e91 damages, 91 strategy of attack, 88 two attack strategies, 89e90 Hacking nightmare, 28 Haeckel’s biogenetic law, 427 Haldane’s hypothesis, 427 Hamilton’s altruism theory, 427 Hamilton’s rule, 427 Hardware to bioware, 327e329, 328f Hardy-Weinberg law, 427 Harvard University, 267 Helixworks, 269 Heritability, 427 Holy Grail of data deluge, 126 Huffman compression rule, 117 Huffman encoding method, 142e143 Human Cell Atlas (HCA), 228, 379e381 Human DNA system anatomy of, 18e19 biological profiling, 23e24 central dogma of genetics, 19 CRISPR, 25e27 fighting chaos, 18 fingerprinting, 28 gene editing, 24e25 genetic code, 20e22, 20fe21f properties of, 22e23
449
hack computers, 29e30 hacking nightmare, 28 “magical DNA”, 17 warm place for music, 29 yogurt story, 25 Human genes in bytes, 111 Hyperscale data centers, 108e109
I IBM’s DeepLocker, 201e202 Illumina method, 231 Infected DNA, 282e283 Information communication technology (ICT) infrastructure, 323e324 Information consumption, 99 Intelligence Advanced Research Projects Activity (IARPA), 269e270, 324 Intelligent smart shield, 45 Internet of things (IoT), 323e324, 100 Islam, 418
J Jian-Jun Shu discovery, 181e182
K Knowledge acquisition component, 49, 300, 301f Knowledge-based system, 55
L Launch pad, 194 Linux, 229 Longevity, 329 Long strings of nucleotides, 20 Loyalty, 329 Lyon hypothesis, 427
M Mac OS X, 229 “Magical DNA”, 17 Malware technology, 240e241, 278e280, 280f Managed service providers, 107 Mass killer, 348e349 Messenger RNA (mRNA), 247f Micron technology, 268 Microsoft, 328 Microsoft research, 268 Molecular information storage (MIST), 324e325 Molecular program, 156 Moore’s law, 188 Mortality, 393
450
Index
mRNA, 22 Muller’s ratchet, 427 MyHeritage website leakage, 87e88
N Naive encoding method, 150 Nanopowered malware, 198e199 National Medical DNA, 378 National Television Systems Committee (NTSC), 187 Neo-Internet, 80 Neurons, 205 New genetic editing tools DNA crime, 84e85 digital data hacking, 84 gene hacking, 85e87 Nondeterministic universal turing machine, 185 Nucleotides, 19e20
O Oligonucleotide, 155 One-time pad (OTP), 317e318 “Open source” program, 177
P Paleoanthropology, 424 Parental investment theory, 427 Phase alternation line (PAL), 187 Polymerase, 22 Polymerase chain reaction (PCR), 140e141, 185, 392 Population genetics, 346 Portable executive, 302 Potentialities, DNA, 329e331, 331f Poverty, 344e346, 345f Power usage effectiveness (PUE), 329 Predestined program, 2 Programmable logic controller (PLC), 89 Protein, 18e19, 22 clock hypothesis, 428 synthesis, 22
Q Quantum annealing, 190 Quantum computing, 188e190 vs. DNA computer, 191e192 entanglement process, 190 Quantum entanglement, 189 Quantum superposition, 189 Quantum tunneling, 189
R Random access method, 141e142, 146 Rapid heuristic algorithms, 182 Reasoning engine, 54 Red queen theory, 428 Reliability vs. density, 150e152 Religion atheistic arrogance, 419 bacteria, 414e417 cold reception, 419e420 Combined DNA Identification System (CODIS), 422 CRISPR, 415e418 DNA fingerprint, 421 evolution, 423 evolutionary biology, 426e428 genes programming, 428e429 God’s domain, 418e419 holy Quran, 428e430 human future diary, 421e422 information capacity, 430 medical data, 421 personal data, 421 religious communities, 420e421 science, 415 scriptural DNA, 415 Religious communities, 420e421 Replication, 22 Resource viewer, 302 Reverse engineering center, 57, 301e302, 348e349 Ribosome, 173e174
S Sample system performance, 68 Scriptural DNA, 415 Selection coefficient, 428 Selection differential (S), 428 Sequence-specific adaptive immunity, 25 Sequencing depth, 149 reduced, 149 Sequencing DNA-encoded data biohacking, 242e243 blockchains (BCs), 234 A-train, 236 competitionethereum, 238 disadvantages, 238e240 malware, 241 cannabis dilemma, 245 digital DNA sequencing, 233 DNA drone hacking, 244 DNA satellite hacking, 243 DNA storage libraries, 233e234 Human Cell Atlas (HCA) project, 228
Index
illumina method, 231 Linux, 229 Mac OS X, 229 malware, 240e241 Messenger RNA (mRNA), 247f operations, 231e232 sinkholes, 237 smart city ontology, 226e227 sunrise, 236e237 unit conversion table, 247e249 Sequential Couleur a Memoire or sequential color with memory (SECAM), 187 Sherlock software, 194, 206 Sinkholes, 237 Smart cities, 42e45 critical infrastructure, 57 critical infrastructures, 303, 303fe304f ecosystem, 42, 43f idealistic hype, 66e68 ontology, 226e227 Smart data, 325e327 Smart grid model, 65e68, 308e311 Smart Nanogrid, 64 critical systems to, 65e66 Smart vaccine center, 41e42, 62, 305, 305fe306f Social crimes, 353 Anomie, 345e346 biological determinism, 346 cybercrime eradication, 349 datacenter power consumption, 353 definition, 344 DNA data storage, 346, 353 genetic complexity, 346 mass killer, 348e349 population genetics, 346 poverty, 344e346, 345f reverse engineer DNA, 348e349 smart cities, 349 sources, 343e344 street crime, 344 subject-specific DNA library, 347t urban population, 350f Social media data center, 108 Standard neural network (NN), 205 Storage demand, 369e370 Storage supply, 369e370 Street crime, 344 Stuxnet, 80e82 crack in door, 90e91 damages, 91 strategy of attack, 88 two attack strategies, 89e90
451
Subject-specific DNA library, 347t Sunrise, 236e237 Superposition, 188e189 Supervisory Control and Data Acquisition (SCADA) system, 89 Sustainability, 329 Svalbard Global Data Vault, 400f, 399 Swiss Federal Institute of Technology (ETH), Zurich, 268
T Tangled bank theory, 428 Telomere story, 393e394, 393f Thermo Fisher, 147 Time machine amazing storage phenomenon, 397 biological clock, 393 DNA storage random access retrieval, 397e399 mortality, 393 polymerase chain reaction (PCR), 392 prediction, 392 storage evolution over time, 397 Svalbard Global Data Vault, 400f, 399 telomere story, 393e394, 393f time travel, 394e396 Time travel, 394e396 Transcription, 173e174 Translation, 173e174 TriverseWillard hypothesis, 428 tRNA, 23
U Unit conversion table, 247e249 University of Illinois, UrbanaeChampaign, 267e268 University of Washington, 267 Urban population, 350f
V Vaccine knowledge base (VaKB), 63, 306 van Baer’s rule, 428 Vesicular MonoAmine Transporter 2 (VMAT2), 424e425 Video media features, 185e186 Virus knowledge base (ViKB), 63, 307, 307f
W Weismann’s hypothesis, 428 word on word processing, 122e123 WrighteFisher Model, 428
X XOR encoding method, 145e146