362 110 3MB
English Pages 228 Year 2023
Julia for Data Science MEAP V03 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Copyright_2023_Manning_Publications welcome 1_Introduction 2_Julia_Programming:_Data_Types_and_Structures 3_Julia_Programming:_Conditionals,_Loops_and_Functions 4_Importing_Data 5_Data_Analysis_and_Manipulation 6_Data_Visualization Appendix_A._Setting_up_the_Environment Appendix_B._Importing_Data_from_Different_Files
MEAP Edition Manning Early Access Program Julia for Data Science Version 3 Copyright 2023 Manning Publications ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://livebook.manning.com/book/julia-for-data-science/discussion
For more information on this and other Manning titles go to manning.com
welcome Welcome and thank you for selecting Julia for Data Science as your guide to exploring the powerful world of Julia programming and its applications in data science. This book aims to provide you with a clear, practical, and engaging approach to mastering Julia programming and applying it to various data science challenges. This book is suitable for anyone interested in data science and who wants to learn Julia from scratch. Whether you are a student, an academic, or a professional, this book will help you to acquire the necessary skills to analyze data and solve real-world problems using Julia. We assume that you possess some background in programming and basic knowledge of data science concepts. This book is designed to help you build on your existing expertise and make a smooth transition to the world of Julia programming. Starting with the first three chapters, you will be introduced to the core principles of Julia programming, beginning with the fundamentals and gradually moving to more advanced topics. As you gain confidence in Julia, the book will explore supervised learning, unsupervised learning, and deep learning algorithms using Julia. We’ll not only utilize popular Julia packages for these purposes but also teach you how to develop these models from scratch whenever possible. Currently the first four chapters are published online. Our aim is to add one new chapter each month. Throughout the book, you'll encounter real-life examples that demonstrate the versatility and power of Julia in the field of data science. We hope that Julia for Data Science becomes an invaluable tool in your quest to master Julia programming and its applications in data science. Your feedback is essential for making this book the best it can be, so we encourage you to share your thoughts, questions, and suggestions in the liveBook
discussion forum. Thank you once again for choosing Julia for Data Science, and we wish you an enjoyable and fruitful learning journey. Warm regards, —İlker Arslan In this book
Copyright 2023 Manning Publications welcome brief contents 1 Introduction 2 Julia Programming: Data Types and Structures 3 Julia Programming: Conditionals, Loops and Functions 4 Importing Data 5 Data Analysis and Manipulation 6 Data Visualization Appendix A. Setting up the Environment Appendix B. Importing Data from Different Files
1 Introduction This chapter covers The need for Julia Investing in Julia Solving development and production problems with Julia When you step in a new field or discipline which requires coding, it always comes to the decision of choosing the programming language. There may be various reasons to choose a language: suitability to the field, ease of use, learning path, community, or just preference of your colleagues etc. Every language has its pros and cons. Some languages are very easy to learn and code, some are very fast, some are suitable for special purposes. Some people use a language for years and it is just easy for them to continue with that language. The same is true for companies and it is much harder for a company to change. It may be challenging to change a programming language that has been used in a company for years. Legacy systems may depend on one specific language. Or maybe people in the company are proficient in one language. In such cases significant investments may be required to switch programming languages. In the last decade I have changed my language of preference from C++ and Matlab to R and then to Python and finally to Julia. I’ve decided to go with Julia for two reasons: ease of use like Python and speed like C++. I have observed that individuals and companies have undergone a similar progress like me. Most analytics teams have switched to Python and R from Matlab, SAS etc. If I may quote from Andrew Ng, programming languages are just like tools you use. A carpenter may use a traditional plane or a hightech CNC machine based on the requirements of the job. Same is true for programming languages. You have to choose right tools to keep in your toolbox.
The most common type of question I’ve been asked for the last ten years is about choosing the right tool for statistics, data science, machine learning, and numerical computing. Recently, one of the most frequent questions I’ve been asked is “Is it worth learning Julia?”. Funny thing is, people who ask this question are mostly the ones who wondered whether R or Python worth learning, a few years ago. My observation is that many data science and analytics professionals are like me and hesitate to switch to Julia because they have already invested in Python, R, etc. My story is very similar. I’ve been aware of Julia since 2015 but at the beginning I was a bit cautious because it was a very new programming language. I haven’t switched to Julia immediately, but I liked its features. The first thing that impressed me about Julia was the authenticity of the standard libraries. Languages like R and Python and many known libraries within them are written with low-level languages like C, C++, or Fortran. Although a large portion of Julia is also written in other languages like C, C++, Scheme and Lisp, a lot of Julia is written in Julia. Most importantly, standard libraries including the Base library and the operations are written in Julia. So, what? Why is that important for us? If you start to learn R or Python, and sail too far from the shore, beyond some point you’ll need using another language for performance. For example, if you want to develop fast R or Python packages, you have to use C or C++. On the other hand, it is not the case for Julia. Once you start coding in Julia, you can always rely on it. You can write your packages in Julia without worrying about the run time. Another point about Python and R is that, most widely used packages in these languages have their own syntax and you’ll need to spend some time to learn them. For example, in R, there are two popular packages which are very nice for data analysis and manipulation: dplyr and data.table. Syntax of these packages are quite different than R and you should spend some time to learn them after you learn R. Or imagine learning PyTorch or Tensorflow as a Python programmer. Again, you will need extra time and effort to learn them. At least that was the case for me. But that is not generally the case in Julia. Julia libraries for data science and machine learning are written in Julia and using them is just like using base Julia. For example, the most widely used
deep learning library in Julia is Flux and you don’t have to spend extra effort to learn it. When I first read the Flux tutorial for deep learning, I’ve understood what was going on by reading the code without reading the explanations. It was enough for me to know Julia syntax. The following quotes are from a discussion in discourse which I believe give the idea very well: Though Julia is not a self-host language, in practice, especially in scientific research, a pitch I often give is: all the parts you would ever care about is in Julia and you can read, understand, and even tweak at will. I’m referring to the fact that base and stdlib are in Julia and packages are more often pure Julia (at least for the part that does heavy lifting). Unlike, for example, Numpy, where one hits C code super quickly once start digging (even function as simple as mean is written in C). (https://discourse.julialang.org/t/how-is-julia-written-in-julia/50918/8) I’ve definitely solved a few problems unrelated to Julia by looking at a method in stdlib to see what the most efficient solution to a linear algebra problem was. (https://discourse.julialang.org/t/how-is-julia-written-injulia/50918/9) Having said these, the readers should note that Julia is not 100% protected from the syntax problems described. The reason is that Julia supports a very powerful feature called metaprogramming which is the technique of writing code which generates new code. I want to quote from Julia documentation: The strongest legacy of Lisp in the Julia language is its metaprogramming support. Like Lisp, Julia represents its own code as a data structure of the language itself. Since code is represented by objects that can be created and manipulated from within the language, it is possible for a program to transform and generate its own code. (https://docs.julialang.org/en/v1/manual/metaprogramming/) A function in programming takes in data and creates data as output. Similarly, a metaprogram takes in code and creates code as output. Besides other things, metaprogramming allows us to write domain specific languages. So, for instance, you can write your own Julia packages with a totally new
syntax. I will not deep dive in metaprogramming in this book but you should keep in mind that it is a very powerful feature and overusing it may introduce the syntax problems I’ve described above. In recent years, Julia has proved to be a great programming language for statistics, math, data science, machine learning, deep learning, and scientific computing, besides many other things. If you have used R, Matlab or Python before, you’ll see that Julia’s syntax is very much like those. It is really easy to learn Julia and read code written in Julia. Another nice feature of Julia is its speed which competes with those of C or C++ which are the benchmark languages when we talk about speed. But I’ll come to that and explain it in detail soon. If you have been in a data science or machine learning project before, you probably know that developing a model and deploying it are two different things and it may really be a pain to accomplish them together. Most of the time, data scientists develop a model using Python or R and then data engineers or IT teams use another language to deploy the model. This is because languages like Python, R, Matlab or Mathematica are very flexible, easy to learn and human readable. So, development times are shorter with these languages. But run times of these languages are slow which doesn’t make them ideal for use in production. For example, for a bank with many credit card applications, it is probably not a good idea to use Python code to calculate application scores. When it comes to production, C, C++, or Java are great. But developing solutions or models with these languages may be very time consuming. That’s why programs written in high level languages should be refactored for low level languages like C, C++, or C# for production. This is an issue which is mostly referred as two-language problem in Julia community. I will mention about two-language problem and how Julia solves it in more detail in the next section. Table 1.1 provides a high-level comparison of Julia with other dynamically typed languages and low-level languages. With flexibility of high-level languages and speed of low-level languages, Julia seems to be the best option for data science, analytics, machine learning and numerical computing.
Table 1.1 Comparison of Julia with other languages
Language
Flexibility and Ease of Use
Speed
Package Dependency to LowManagement Level Languages
Julia
✔
✔
✔
✔
Python, R
✔
✘
✔
✘
C, C++
✘
✔
✘
1.1 Two Language Problem There are various reasons for me to use Julia and the one which comes first to my mind is that it addresses the two-language problem. To explain the twolanguage problem I have to first explain the model development process. When you develop an analytical model you first get the data and first do some exploratory analysis and visualization to have more insight about the data. Then you prepare the data for modeling. This step includes handling missing values, outliers, feature engineering etc. Next, you develop the model with the algorithms of your choice. After the models are trained, you decide what to do next based on the model performance. If the results present high bias (or underfitting) then return to feature engineering step and try to find more relevant features or increase model complexity. If you observe high variance (or overfitting) then you gather more data or decrease model complexity. If you finally find a model with satisfactory performance for production, hand it over the IT or data engineering team to deploy the model in production.
You can use microservices or containers to deploy models in Python but this requires communication time between systems and will also create another point of failure. That’s why, most of the time IT team should refactor the code to languages like C, C# or Java which run faster in production environment. That’s called the two-language problem. Analytics team use one language to develop models and another language is used to deploy these models in production. Two language problem costs extra time and effort most of the time. I’ve experienced this myself in my professional life. I’ve been in some analytics projects. We have developed models using Python or R. Then we’ve handed the code over IT guys. Every time, we were spending some time to explain the algorithm and code to them. Then they were working on refactoring the code in their language of use, which is mostly C# or Java. With Julia, we can save these times spent and go directly from development to production. Unlike other dynamic languages like Python or R, Julia is a modern language and creators of Julia were keeping researchers and data scientists in mind. Julia’s great advantage is that it can be integrated with production platforms with minimum effort. Another type of “two language problem” may appear when the main program is written in a high level, slow interpreted language like Python or R and several parts or libraries are implemented in low level languages like C, C++. As I’ve been explaining, Julia solves this problem too because you won’t need a low-level language when you work with Julia. Thus, we can say that Julia solves the two-language problem in both ways. Figure 1.1 provides a mental model for the modeling lifecycle I’ve described. The first part is the responsibility of analytics teams and includes steps from collecting data to selecting a champion model. The second part is the responsibility of IT or data engineering teams and covers running and maintaining a model in production environment. Figure 1.1 A high-level sketch of modeling lifecycle. In analytics part we need ease of syntax for the steps before training. For training and production, we need speed. Julia provides both and eliminates refactoring step. This mental model of lifecycle covers main steps of model development and production. We will often refer to these steps in the subsequent chapters.
In this book we will follow a roadmap outlined in Figure 1.1. The second and third chapters will cover programming concepts using Julia and each subsequent chapter will focus on completing a separate project. The second section will explore real-life cases for importing, analyzing, and visualizing data with Julia. Then, we will develop machine learning models with different algorithms using Julia. One section will be dedicated to supervised learning algorithms while another one to unsupervised learning. We will then develop deep learning models. Finally, we will see how we can deploy a model in production using Julia. Additional Resources
You may refer to the following articles and posts about two language problem and Julia in production. ∙ JuliaCon2020: Julia is production ready, Bogumil Kaminski, 7 Aug 2020, https://bkamins.github.io/julialang/2020/08/07/production-ready.html ∙ Julia: A Fresh Approach to Technical Computing, Dr. Viral B. Shah, 27 Aug 2020, https://youtu.be/tUWZ6XhC2K4
∙ High level case studies of Julia in production: https://juliacomputing.com/case-studies/
1.2 Julia is Fast The next reason I like Julia is its speed. In this section I will first tell why Julia’s being fast matters and then I will mention about benchmark studies which show Julia is really fast.
1.2.1 Why does it matter? Julia is well-suited for numerical computing, data science, machine learning and artificial intelligence. Python and R are also very useful for these purposes and they provide very nice front-end user interfaces. But they are slow. I have mentioned that run time speed should be high during production but it is also as important during development. You need to run the algorithms or models fast to keep the development phase short. The development phase mainly consists of three phases. Have an idea to develop an algorithm or a machine learning model, design the algorithm and code it using your favorite language (Figure 1.1 Modeling). Run the code, evaluate and fine-tune the results (Figure 1.1 Training, Validation and Tuning). Test the results and select the champion model (Figure 1.1 Testing, Model Selection). If you use a language like Python or R then modeling is easy and you can code your idea in a short time. But unless you use a library implemented in a low-level language you will wait for a long time to see the results of the training. (But in that case, you won’t be able to inspect what is going on in the C++ code unless you are a C++ expert.) If your code runs, say for a week, this means you have to wait for one week to see whether your idea works. Then you’ll analyze the results, fine tune your model and run the code again to see the new results. It is true that many libraries in Python and R, are implemented in C/C++ but you will still rely on slower libraries for data ingestion, analysis, manipulation etc. So, it will take for a long time to reach
the best results. Conversely, if you choose a language like C, C++ or Java you will spend much time for modeling. In addition, most probably your code will not be very flexible and it will be a pain to modify your code whenever you want to change the architecture. In addition, the steps like data analysis, visualization and feature engineering won’t be easy. How I started to use Julia?
Now comes the story about how I decided to adapt my studies to Julia. When I was working on my PhD thesis, I wrote my code in C++. My code was more than 2.000 rows and it took me almost a month to write it. I have been using MATLAB to visualize the results. Besides, when I wanted to change something in the model, it was difficult to modify. But on the plus side, the run time was just a few minutes. A few years ago, I wanted to make some modifications in the model for a new experiment. My aim was to provide more flexibility to the model to try different setups in a short time. First, I tried refactoring the code in R. With R, I could change the architecture very quickly but this time run time was very long. Once I changed the setup I had to wait for hours to see the results. Next, I decided to give Julia a try and wrote the model in Julia and the run time was the same as the C++ implementation. Finally, I got both the flexibility and the speed. A solution to improve run-time speed of dynamic programming languages is writing the inner kernel in a low-level language like C or C++ with Python or R interface. Most of the time this won’t be a problem if you are using common packages. But if you want to write your own packages or functions you will need C or C++ to improve run time. Julia aims to solve the run-time problem while combining ease of use and speed. Once you start programming in Julia, you won’t have to learn another language no matter how far you go. Julia is a high-level, dynamically typed language but its speed is comparable with C.
1.2.2 Is Julia Really Fast?
There are various benchmark studies to compare speed of Julia with other languages. The first example I want to mention about is from the official web page of Julia language (https://julialang.org//benchmarks/). In this experiment eight algorithms are run in 12 different languages. Julia was one of the best performers with C, Lua, and Rust. The remarkable thing is that worst performers in the study were the languages those were used most by data science and machine learning practitioners: Matlab, Mathematica, Python, R and Octave in decreasing order of performance. In the book Julia High Performance by Avik Sengupta, the writer compares run time performance of 10 programming languages with C to calculate Mandelbrot Set. Again, Julia is one of the best performers with languages Fortran, Javascript and Lua. On the other hand, worst performers are languages of numerical computing and data science: Mathematica, Matlab, Python and R in decreasing order of performance. Another example compares run time for adding 10 million random numbers (https://github.com/JuliaAcademy/JuliaTutorials/blob/main/introductorytutorials/intro-to-julia/09.%20Julia%20is%20fast.ipynb). I’ve run this code in my computer and got the results in the following table. The results show that, Julia can compete with even C in terms of speed. Python Numpy is relatively fast but it uses C for speed. Table 1.2 Run times for adding 10 million random numbers with different programming languages
Program
Mean Duration (ms)
Julia hand-written SIMD
3.0
Julia built-in
3.0
C -ffast-math
5.1
Python Numpy
8.0
Julia hand-written
8.9
C
9.1
Python built-in
536.9
Python hand-written
760.5
The last example is from the real-life practice. After C, C++ and Fortran, Julia has been the only modern language which has a speed faster than one petaflop per second at peak performance (https://juliacomputing.com/casestudies/celeste/). This performance was achieved in a Project called Celeste which is having all telescope data in the world and creating one list of all stars and galaxies. Additional Resources
You may refer to the following articles for detailed information on benchmark studies which compare speed of Julia with other languages. ∙ Julia Micro-Benchmarks, https://julialang.org/benchmarks/ ∙ Database-like ops benchmark, https://h2oai.github.io/db-benchmark/ ∙ Benchmark of popular graph/network packages v2, Timothy Lin, https://www.timlrx.com/blog/benchmark-of-popular-graph-networkpackages-v2 ∙ Query times for reading csv files with different data types: https://www.queryverse.org/benchmarks/
So, what does these examples tell us? If you are a practitioner in analytics, numerical computing, machine learning, data science etc. and want to speed up your code, you can choose using a language like C but this won’t be ideal for your routine analysis and modeling tasks. On the other hand, you will wait long times for your models to train. Using Julia provides both the simplicity of Python and R and run time speed of C and C++. Julia is also easy to write and learn similar to other dynamic languages like Python, R, or MATLAB. But unlike these languages, performance is at the core of Julia. Hence, you can write high-performance programs and libraries in native Julia.
1.2.3 The engine behind the speed of Julia The main reason behind the speed of Julia is that it is a Just in Time compiled language, not an interpreted one. This means, the code of Julia is converted to machine code for execution on the CPU run-time. Opposite to interpreted languages, Julia runtime executes the source code directly. This compilation infrastructure is built on top of LLVM. LLVM project aims to provide a compilation strategy for both static and dynamic compilation of programming languages. And it is the engine which provides the speed to Julia. LLVM has been used to implement compilers for other languages like C, C++, Fortran, Rust besides Julia. The details of LLVM project is beyond the scope of this book so I will not go in detail but if you like you can go to www.llvm.org and have more detail about it. LLVM provides the basic infrastructure to produce machine code that runs fast. But it is not just LLVM that provides the speed. Julia is also designed for high speed. Design of types in Julia also plays a very important role in its speed. Julia works with LLVM just in time (JIT) compiler framework. Julia code is parsed and types are generated when the code is run for the first time. Then, JIT compiler generates the LLVM code. Next, this LLVM code is optimized and compiled to the native code. Next time you run the code, the native code is called. For that reason, running a Julia code for the first time takes a bit longer than the subsequent runs. The following code gives the time passed to sum 10 million random numbers. The second run is almost eight times faster than the first run.
Julia> x = rand(10_000_000); julia> @time sum(x) 0.026387 seconds 5.000725567830743e6 julia> @time sum(x) 0.003380 seconds 5.000725567830743e6
Good news is that with newer versions of Julia, this difference between first and second runs become smaller. By the time of writing, stable release of Julia was v1.8.5 and v.1.9.0-beta4 version was also available. I’ve run the following code in both versions. Don’t worry if you don’t understand it. It simply plots sine and cosine functions. julia> julia> julia> julia>
using Plots x = range(0, 10, length=1000); y1 = sin.(x); y2 = cos.(x);
The following lines show the time to first plot in both versions (first in 1.8.5 and second in 1.9.0-beta4). As seen below, the difference is very promising in favor of the upcoming versions. julia> @time plot(x, [y1 y2], label=["sin(x)" "cos(x)"], linewidth=3) 1.161421 seconds julia> @time plot(x, [y1 y2], label=["sin(x)" "cos(x)"], linewidth=3) 0.101233 seconds
1.3 Package Ecosystem In the modern world, package ecosystem is also important for a programming language. All programming languages come with standard capabilities. You may not find all of the functionalities you want when you first install the programming language. For example, if you want to draw quality plots you will not find it in the standard library so you will either develop it yourself from scratch or you will find a library which provides plotting capabilities. Fortunately, there are libraries prepared for special purposes. In Python there is matplotlib, in R there is ggplot and in Julia there is Plots which gives advanced plotting capabilities.
A decade ago people who were doing research on machine learning and deep learning were spending a lot of time to develop models in C, C++ or Matlab. But, in the last ten years machine learning and deep learning libraries were developed. For example, Tensorflow, PyTorch or Scikit-Learn are the most popular deep learning and machine learning libraries for Python. Thanks to these libraries, now people can develop complex models with much less effort. When I talk about Julia, one of the arguments I get is its weak package ecosystem. That might be true a few years ago but in recent years Julia’s number of registered Julia packages has been increasing with an accelerating rate. Python and R have very rich package ecosystems. For example, currently R has nearly 19 thousand packages available. But, this was less than 6 thousand by early 2015. Julia’s package ecosystem is also developing very fast. As of the time of writing this book, number of registered Julia packages was about 7.800. This is not bad at all if we consider that the 1.0 version was released just in 2018 and number of packages was less than 2.500 by early 2019. Additional Resources
Besides many others, I find the following articles and posts helpful about the reasons to learn Julia. ∙ The Rise of Julia – Is it Worth Learning in 2022?, Bekhruz Tuychiev, 19 May 2022, https://www.datacamp.com/blog/the-rise-of-julia-is-it-worthlearning-in-2022 ∙ Bye-bye Python. Hello Julia! As Python’s lifetime grinds to a halt, a hot new competitor is emerging, Ari Joury, 2 May 2020, https://towardsdatascience.com/bye-bye-python-hello-julia-9230bff0df62 ∙ 5 Ways Julia is Better Than Python. Why Julia is better than Python for DS/ML?, Emmett Boudreau, 20 Jan 2020, https://towardsdatascience.com/5ways-julia-is-better-than-python-334cc66d64ae ∙ Why you should learn Julia, as a beginner / first time programmer, Logan
Kilpatrick, 10 Dec 2021, https://blog.devgenius.io/why-you-should-learnjulia-as-a-beginner-first-time-programmer-96e0ad33faba ∙ Why You Should Invest in Julia Now, as a Data Scientist? Know what Julia has to offer and the resources to get started., Logan Kilpatrick, 7 Dec 2021, https://betterprogramming.pub/why-you-should-invest-in-julia-now-asa-data-scientist-30dc346d62e4 ∙ Inspiring stories of people who started to use Julia: https://julialang.org/blog/2022/02/10years/
1.4 Summary A model lifecycle includes two main phases: development and production. Traditionally a high-level programming language like Python, R or Matlab is used in development phase for their ease of use. In the production, a low-level programming language like C++, C# or Java is used for their speed. Deployment of a model in production generally requires extra time and effort for refactoring the model. A similar situation is true during development. For data import, data analysis, visualization, feature engineering and modeling steps require ease of use while training step require speed. Julia is the only language which provides ease of use and speed together. Various benchmark studies show that speed of Julia competes with that of C. Indeed, Julia is the fourth language (first modern language) with a speed faster than one petaflop per second. Julia’s package ecosystem grows with acceleration speed although it is a young programming language.
2 Julia Programming: Data Types and Structures This chapter covers Defining and using variables Using operators and basic operations Understanding data types and type hierarchy Storing and retrieving data with Data Structures You may already know how to program in Julia or you may be coming from another programming language. If you are in the first group, you may skip this and the next chapters and refer to them as reference whenever you need. If you are in the latter group you may use these chapters for a quick start to Julia. If you haven’t coded in Julia before I suggest going through the Appendix for installation guide.
2.1 Variables, Data Types and Operations Types are at the very core of Julia programming language. Julia is a dynamically typed language like Python and R. This means, you don’t have to specify the type of a variable when you define it. On the other hand, there are times, when specifying data types in advance, increase the performance. For example, when reading large amounts of data, specifying the data type of columns in advance may increase performance. Or we can specify the data type as Float32 instead of Float64 to improve performance when dealing with numerical operations that involve large arrays of floating-point numbers. We can do the same when working with GPUs as they are designed to work with specific data types. Thus, we can say that Julia is an optionally typed language. A novel feature of Julia types is that we can define our own types. In addition, we can also parameterize types. These are possible with the help of
the type hierarchy in Julia which will be explained soon.
2.1.1 Variables Declaring a variable is no different than any other dynamically typed language. You don’t have to specify the data type. Just type something like x = 44 and Julia will infer that x is a variable with the value of 44 and type integer. You can later change the value of the variable with another type. If you now type x = “Julia”, data type of x will automatically be changed to string. p = 3.14 #A # 3.14 #A typeof(p) #B # Float64 #B p = "some text" # "some text" typeof(p) #C # String #C
#C #C
COMMENTS
Every programming language uses comments to write explanatory notes about the code. In Julia, comment sign is #. The codes after # in the same line are omitted. For multiline comments, you should put #= to the beginning of the first line and =# to the end of the last line. # This is a one-line comment. #= This is a multi-line comment =# In this book, the outputs are shown as comments so you can copy and paste them to the code editor to run without manual editing. It is possible to declare multiple variables at once. x, y, z = 10, 15, 20; x = 10; y = 15; z = 20;
#A #B
x, y = y, z;
#C
We can also use Unicode characters as variable or function names. To type a Unicode character, type a backslash (\) which is followed by the name of the character and press TAB. You can see the full list of available Unicode characters in Julia at Julia Documentation (https://docs.julialang.org/en/v1/manual/unicode-input/). α β α #
= 6; = 7; * β 42
#A #B #C
We can also use subscript and superscripts in variable names. For subscript, use \_ and for superscript use \^ followed by the sub or superscript respectively and then press TAB. You should be familiar with this if you use LATEX. a² = 15; β₀ = 12;
#A #B
Variable names are case sensitive and they may start with a lowercase or uppercase letter, Unicode character or underscore (_). They may include numeric characters except the first character. Keywords like if, else, begin or end cannot be used as variable names. List of the keywords can be found in Julia documentation (https://docs.julialang.org/en/v1/base/base/#Keywords).
2.1.2 Type Hierarchy in Julia Data types are tags on values that restrict the range of potential values that can be stored at that location. In Julia, it is very important to understand how types work. Types are not defined as a bunch of collections but they have a structured hierarchy. Types have subtypes and supertypes this hierarchy. At the top, there is the type Any. Below Any there is a long pyramid of type hierarchies. Except Any, every type is a subtype of another one but not all types have subtypes.
Figure 2.1 Type Hierarchy in Julia. Types have supertypes and subtypes. At the top there is the type Any. Abstract types have only subtypes. They can’t be instantiated. We can create variables with only concrete types.
In Figure 2.1. you see that some types are represented in dashed boxes and some in solid ones. The ones in dashed boxes are called abstract types. Those cannot be instantiated, which means we cannot create variables in an abstract data type. They are nodes on the type hierarchy which classify the relations and group similar types together. Abstract types are important because they form the skeleton of the type hierarchy. Besides other uses, abstract types allow us to define functions with specified types. For example, we may want to define a function which uses only real or complex numbers. In this case, we can limit the argument type with Number which is the supertype of both. Other types which can be instantiated and store values are called concrete types and we can create variables with these types. We can define our own abstract types using the following syntax: abstract type MyType end Use 60 #A println("passed") end #C
#B
Obviously, nothing happens if score is less than 60. In most cases, we may want to get alternative results depending on the correctness of the logical expression. In that case, we should use the else keyword to tell what will happen if the condition is wrong. Now let’s tell the program what to do if the logical expression is not true. if score > 60 #A println("Congratulations, you passed!") else #B println("Sorry. You failed.") #B end
#A
Now let’s go one step further. Instead of just saying passed or failed we want to get the letter grade depending on the score. If the score is greater than or equal to 85 (score≥85), the grade is A. If that is not true but the score is greater than or equal to 70 (70≤score 0) && (println("Your age is $age.")) #A (age < 0) || (println("Enter a positive number")) #B age = "twenty" notInteger = !(typeof(age) 10, "b" => 20, "c" => 30) for (k, v) in d println("Key: $k, Value: $v") end # Key: c, Value: 30 # Key: b, Value: 20 # Key: a, Value: 10
3.4.2 While Loops The next type of loops is the while loop. While loops are used to repeat a task as long as a condition holds. At each iteration a condition is checked. If the condition holds, then given tasks are executed. Then the state is updated. This is something different then the for loop. If we don’t update the state, initial conditions never change and the condition holds forever. This causes an infinite loop. While loops are also used when we don’t know how long the process will continue in advance. For example, assume you ask for an input from the user and design the code such that the program continues to ask for a new input until a specified input is entered by the user.
The general syntax for while loops in Julia is like that: initial state while true
update state end
Let’s start with a very simple example. I want to calculate squares of positive integers less than five. i = 1
#A
while i < 5 #B println(i^2) i += 1 #D end # 1 # 4 # 9 # 16
#C
Another example: arr = [3, 5, 7, 9, 11, 13, 15]; while !isempty(arr) print(pop!(arr), " ") end # 15 13 11 9 7 5 3
And the example I’ve mentioned at the beginning which expects an input from the user: begin input = nothing arr = [] while input != 0 print("Enter a number (0 to exit): ") global input = parse(Int, readline()) append!(arr, input) end
println("Sum of the numbers you entered: ", sum(arr)) end
The global keyword denotes that the input variable inside the while loop is the same variable as the one outside the loop. If we don’t write anything here there will be an ambiguity about whether the two input variables are the same or different. If we write local instead of global then two variables will be different. One in the global scope, outside the while loop and one in the local scope, inside the while loop. And you will get an error because the input variable inside the while loop is not initiated. That’s why it is good practice to denote global and local variables clearly. Figure 3.4 Flow of a while loop. First set the initial state. Continue to do same task until a condition for the state is met. Don’t forget to update the state at each iteration or you will have an infinite loop.
3.4.3 Break and Continue In some cases, during the regular execution of the loop, we may want to stop looping if a specific condition is met. Or we may want to skip one step and continue looping for some specific conditions. Assume we want to print the
elements of an array but we want to stop looping if we get a specific number, say 999. We don’t know in advance whether we will meet this number so we should put an if statement in the loop and ask the loop to end if 999 comes. arr = [3, 5, 7, 9, 42, 999, 11, 13, 15, 999, 44]; for el ∈ arr if el == 999 println("Break condition is met!") break end println(el) end # # # # # #
3 5 7 9 42 Break condition is met!
Following is the while loop which does the same job. i = 1 while i 3x^2 + 4x - 5
There are various use cases for anonymous functions and we will use them frequently in the following chapters. For now, have a look at the following code. The map() function is used to apply an anonymous function to the elements of an array. numbers = [3 4 7 8 9 12]; newNumbers = map(x -> 3x^2 -2x, numbers) # 1×6 Matrix{Int64}: # 21 40 133 176 225 408
In Julia, functions take tuples as input and return tuples as output. Let’s see another function which finds the maximum number in an array. function mymax(array) maxnum = typemin(Int64) for num in array if num > maxnum maxnum = num end end
#A
return maxnum end
Normally, a function returns the value of the last evaluated expression. Thus, if the value of the last expression in the function is the value you want you don’t have to use return keyword. Nevertheless, it is good practice to use return whenever possible in long-form functions, to prevent confusion. (For an interesting discussion about usage of return in Julia: https://groups.google.com/g/juliausers/c/4RVR8qQDrUg/m/TCSuWgk9BAAJ). In some cases, you may want to return different values based on a condition. function absDiff(a,b) if a > b return a - b else return b - a end end
We can also write the above code as follows: function absDiff(a,b) return if a > b a - b else b - a end end
A function may have no arguments at all. function greet() println("Welcome to Julia Programming") println("I hope you enjoy...") end greet()
We can also use Unicode characters as function names. ϕ(x,y) = x^y
#A
ϕ(3,4) # 81
We can assign a function name to another function name just like variable names. phi = ϕ phi(3,4) # 81
We can specify the type of the arguments a function can take and also type of the return value. The most important reason for declaring argument types is multiple dispatch. Multiple dispatch is one of the most powerful features of Julia and it will be discussed in the next section. function ratio(x::Int64, y::Int64)::Rational return x//y end
Julia functions may return multiple values. If a function returns multiple values they are returned as a tuple. Following function returns mean and standard deviation of values. function find_mean_sd(arr) mean = sum(x for x in arr) / length(arr) std = √(sum((x-mean)^2 for x in arr)/(length(arr)-1)) return mean, std end array = [4, 5, 6, 8, 12, 34, 65, 98, 76, 36, 35]; μ, σ = find_mean_sd(array) #A result = find_mean_sd(array) #A
We can assign default values for some or all of the arguments in a function. In this way, if the user of the function doesn’t enter a value for the input, the default value is used. The following function returns the square of the number if the user provides one argument. pow(x, y=2) = x^y pow(9) # 81 Operators are functıons:
In Julia, most of the operators are also functions and can be used just like functions. You can use +(x, y) instead of x + y or *(a,b,c) instead of a*b*c.
3.5.2 Variable Number of Arguments Sometimes, we may not know the number of arguments a function will take in advance. In that case, it is better to write a function which can take arbitrary number of arguments. These are called variable number of argument functions or shortly varargs functions. We can define a varargs function using an ellipsis (…) operator. function printall(x...) println(x) end printall("Julia") #A printall("Julia", "Python", "R") #A
#A
3.5.3 Keyword Arguments Up to now we have only used positional arguments. This means, when calling a function, we should use the arguments in the same order they are defined in the function. But sometimes we may have to use functions with many arguments. Functions to plot graphs are good examples for this. In a plot function there may be many arguments like line shape, line width, line color, fill color etc. Trying to use them in the defined order would be very difficult. position_args(x, y=10, z=20) = println("x=$x, y=$y, z=$z") position_args(1) #A x = 1, y = 10, z = 20 #A position_args(1, 2) #A x = 1, y = 2, z = 20 #A position_args(1, z=3) #A # ERROR: MethodError: no method matching position_args(::Int64; z=3)
Besides positional arguments there are also keyword arguments in Julia. Keyword arguments are separated from positional arguments with a semicolon. Now let’s define the same function with keyword arguments. keyword_args(x; y, z=20) = println("x = $x, y = $y, z = $z") keyword_args(1, z=3, y=2) #A # x = 1, y = 2, z = 3 #A
Keyword arguments may or may not have default values like positional arguments. If your function has only keyword arguments then start with a ; inside the parenthesis. keyword_args(; x, y) = println("x = $x and y = $y")
3.5.4 Broadcasting and Dot Syntax Broadcasting is performing elementwise binary operations on arrays of different size. It is used a lot in data science and machine learning applications. To demonstrate how it works, assume we have a vector of length five and a scalar number. If we try to add a vector and a scalar we will get an error because they have different sizes. v n v #
= Vector(1:5); = 7; + n ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)
In fact, what we are trying to do is adding the scalar value to all of the elements of the vector. One way of doing this is replicating the scalar to have the equal size with the vector. v + repeat([n], size(v,1)) # 5-element Vector{Int64}: # 8 # 9 # 10 # 11 # 12
Doing this every time is not efficient. In Julia, there is broadcast() function which does this automatically. broadcast(+, v, n) # 5-element Vector{Int64}: # 8 # 9 # 10 # 11 # 12
There is an even simpler method to do this. Julia provides a dot operator for broadcasting. If we put a dot before an operator the operation is broadcasted and applied elementwise. Thus, v .+ n gives the same result as above. a b a # a # #
= [4 7 2 9 11 15]; = [3 7 3 9 12 15]; == b #A false #A .== b #B 1×6 BitMatrix: #B 0 1 0 1 0 1 #B
Whenever we want to make an elementwise calculation we should use broadcasting. Assume we want to apply the function x^2 + 3x – 5 to all elements of an array. We should put a dot before every operator. x = [3 5 7 9] x.^2 .+ 3x .- 5 # 1×4 Matrix{Int64}: # 13 35 65 103
If you don’t want to put a dot at every operation, you can use the @. macro. In this case it is enough to put @. at the beginning. The following code does the same thing as the previous one. @. x^2 + 3x - 5
If we want to use dot syntax with a function call then the dot should come after the function name and before the parenthesis like exp.(a) or sqrt.(b). f(x) = 3x^2 + 2x + 5 f.(a) # 1×6 Matrix{Int64}: # 61 166 21 266 390
710
We can use broadcasting to filter elements of an array conditionally. vec = [7 22 12 13 16 21 18 76]; vec[ vec .> 20] #A # 3-element Vector{Int64}: # 22 # 21 # 76
#A
3.5.5 Composite Functions In data science and machine learning, most of the time you have to use multiple functions chained together. If we have three functions f, g, and h, we can chain them using f(g(h(x))) syntax. But this syntax gets more complex to write and less readable when the number of functions increase. A simpler way of chaining functions in math is using ◦ sign (\circ TAB). We can chain the functions using f ◦ g ◦ h(x). We can use the same notation in Julia. Assume we have an array of 20 numbers and we want to find the sum of squares of these numbers. We can do it using the functions intertwined. x = rand(-10:10, 20); square(x) = x .^ 2; sum(square(x)) # 643
Julia also provides the ◦ operator to chain the functions. (sum ∘ square)(x)
But things get even simpler with the help of the pipe operator. The pipe operator takes an operand and sends it to the function on the right as the first argument. In Julia the pipe operator is |>. Using pipe operator, we can write x |> f instead of f(x). This may not seem necessary when we have one function but it comes very handy when the number of functions increase. We can write the above chained function as x |> square |> sum
Assume, we want to find the lengths of words in a string. str = "Writing functions in Julia Programming" length.(split(str)) #A str |> split .|> length #B
Please spend some time to understand the following line of code. It calculates the root mean square error of an array. Mathematical formula for this is:
where n is the number of elements in the array and x̄ is the mean of the array. We can calculate rmse without pipe operator writing sqrt(sum((x .- sum(x)/length(x)).^2))
This method has many parentheses and may be confusing. Instead we can use the following syntax with pipe operator which is slightly longer but more readable. I suggest to get used to using pipe operator and you will see things will get easier in the future. (x .- sum(x)/length(x)).^2 |> sum |> sqrt
There is one thing worth mentioning about pipe operator in Julia. If the function you use takes just one argument there will not be a problem. But if the number of arguments is more than one, then you may need to use anonymous functions with pipe operator. In the following code, the pipe operator put the value of 4 in place of a in the function. f(x,y) = x^y - x*y + 5x - 3y 4 |> a -> f(3, a) # 72
3.5.6 Mutating Functions Recall that previously I’ve mentioned that, Julia has a pass-by-reference behavior which means arguments can be changed from inside the function and changes can be seen by the calling code. In other words, Julia functions can modify their arguments. In Julia, by convention, functions which modify their arguments are represented with an exclamation mark at the end of the function name. Let’s see the built-in sort function. It has two versions: mutating and non-mutating. x = [35, 1, -7, 12, -11, -17];
sort(x); x' # 1×6 adjoint(::Vector{Int64}) with eltype Int64: # 35 1 -7 12 -11 -17
The vector x didn’t change after using the sort function. x' gives the transpose of x. I used it to fit the vector to one line. Now let’s use the mutating version of sort. This time sort!() function will mutate the vector x. sort!(x); x' # 1×6 adjoint(::Vector{Int64}) with eltype Int64: # -17 -11 -7 1 12 35
Now let’s write a mutating function ourselves. When dealing with convolutional neural networks, it is common to pad arrays or matrices with zeros. Let’s pad a vector with preceding and succeeding zeros. function padwithzero(vec, n) x = vcat(zeros(n), vec, zeros(n)) return x end x = [35, 1, -7, 12, -11, -17]; padwithzero(x, 2)' # 1×10 adjoint(::Vector{Float64}) with eltype Float64: # 0.0 0.0 35.0 1.0 -7.0 12.0 -11.0 -17.0 0.0 0.0 x' # 1×6 adjoint(::Vector{Int64}) with eltype Int64: # 35 1 -7 12 -11 -17
Original vector didn’t change because the function returns another vector. Now let’s see the mutating version of the function. function padwithzero!(vec, n) for i in 1:n insert!(vec, 1, 0) end for i in 1:n append!(vec, 0) end end padwithzero!(x, 2) x' # 1×10 adjoint(::Vector{Int64}) with eltype Int64: # 0 0 35 1 -7 12 -11 -17 0 0
3.6 Methods In most object-oriented languages, objects have methods but in Julia, functions have methods. It is not extraordinary if we think that functions are objects in Julia. A function is an object that maps a tuple of arguments to a return value, which is also a tuple. Remember that types are at the heart of everything in Julia. Up to now we have mainly defined functions with one method which means the function accepts arguments of one specific type. On the other hand, in Julia, functions may provide a different implementation for different types or counts of arguments. We can define the behavior of function depending on the combination of argument types and counts. A method is the definition of one possible behavior for a function depending on the types of arguments. For example, we can define a function which accepts two arguments both of which are integers and returns sum of them. We can also define the same function which gets two floating point numbers and returns their difference. We can define another method for the same function which gets a string and returns the number of characters in it. And this may go on. So, we have defined three methods for the same function (See Figure 3.6). This is an exaggerated example to give an intuition but it is good practice to design methods packed in a function such that they work in a consistent way. Choosing which method to implement for a function is called “dispatch”. In traditional object-oriented languages, the method is based on the type of the first argument. This is called single dispatch. In Julia, the method is chosen based on the order, types and counts of all arguments, which is called multiple dispatch. Multiple dispatch is one of the most important features of Julia. Figure 3.6 (a) Single dispatch: One function may have only one method. (b) Multiple dispatch: The method to be applied is decided based on the type, number and order of the arguments.
3.6.1 Multiple Dispatch In Julia, all built-in functions use multiple dispatch. We can see the list of all methods defined for a function using methods() function. * # * (generic function with 320 methods)
methods(*) # [1] *(x::T, y::T) where T 908 # "you" => 860 # "the" => 632 # ⋮ # "adore" => 1 # "ignited" => 1 # "wild" => 1 df = DataFrame(singer = String[], word = String[], count = Int64[]) addToDF!(df; artist=singer, wordVec=wordDict) df # 1486×3 DataFrame # Row │ singer word count # │ String String Int64 # ──────┼──────────────────────────── # 1 │ adele i 908 # 2 │ adele you 860 # 3 │ adele the 632 # ⋮ │ ⋮ ⋮ ⋮ # 1484 │ adele adore 1 # 1485 │ adele ignited 1 # 1486 │ adele wild 1 # 1480 rows omitted
The last function readFiletoDF!() uses these four function to get the name of the artist and word counts in a file. In our dataset we have 49 files each standing for a different artist. We will first try our code on a single file to test how it works. Then we will read data from all files and get everything together. file = "adele.txt" folder = "Data/chapter04/SongLyrics" songLyrics = DataFrame(singer = String[], word = String[], count = Int64[]) stop_signs = r"[\[.,!?()\]:1234567890]" readFileToDF!(songLyrics; file=file, folder=folder, remove=stop_signs) songLyrics # 1397×3 DataFrame # Row │ singer word # │ String String
count Int64
# ──────┼──────────────────────── # 1 │ adele i 921 # 2 │ adele you 866 # 3 │ adele the 632 # 4 │ adele me 484 # ⋮ │ ⋮ ⋮ ⋮ # 1395 │ adele adore 1 # 1396 │ adele ignited 1 # 1397 │ adele wild 1 # 1390 rows omitted
As we have got what we want to achieve now we can read text from all of the files and put all of them together in a single data frame. folder = "Data/chapter04/SongLyrics" files = readdir(folder); songLyrics = DataFrame(singer = String[], word = String[], count = Int64[]) for file in files println(file) readFileToDF!(songLyrics; file=file, folder=folder) end songLyrics # 142501×3 DataFrame # Row │ singer word count # │ String String Int64 # ────────┼────────────────────────────── # 1 │ Kanye_West the 2007 # 2 │ Kanye_West i 1867 # 3 │ Kanye_West you 1411 # 4 │ Kanye_West a 1024 # ⋮ │ ⋮ ⋮ ⋮ # 142499 │ rihanna chick 1 # 142500 │ rihanna wild 1 # 142501 │ rihanna kissing 1 # 142494 rows omitted
At this point we may want to save the data frame for future use. The most common way of saving data frames is using csv or xlsx format. Let’s do both. We will use CSV package (https://csv.juliadata.org/stable/index.html) for csv files. Once we have installed it using the usual Pkg.add() or package manager prompt, we can export the dataframe as a csv file using the write() function in
the package. using CSV #A CSV.write("Data/chapter04/songLyrics.csv", songLyrics)
You may refer to the help manager or package documentation to get more information about the keyword arguments of the CSV.write() function. For example, if you want to use another punctuation instead of default comma you can use delim argument. CSV.write("Data/chapter04/songLyricsDelim.csv", songLyrics; delim=';')
If you open the songLyricsDelim.csv file in a text editor, you will see that the columns are separated by a semicolon. One may either prefer saving the table as an excel files. There are various Julia packages to import and export excel files. I have been using XLSX package and I will go on with that here (https://felipenoris.github.io/XLSX.jl/stable/). We can save the dataframe as xlsx file using XLSX.writetable() function. XLSX.writetable("Data/songLyrics.xlsx", songLyrics)
We have read all of the files to one dataframe and then saved that data frame as a file. You may also want to create separate dataframes for the artists and save them as separate sheets in an excel file. Let’s do this for practice. This time we have to change the flow slightly: Create an empty excel file with a name you like. Read the song lyrics from the files one-by-one Create a separate dataframe for each artist Add the data frame to the excel file as a new sheet with the name of the artist When all the files are read, converted to data frames and saved to the excel file as separate sheets, close the excel file. Listing 4.2 Saving in separate sheets
function saveToSheets(fileToSave)
isfile("Data/chapter04/$fileToSave") && throw("This file exists!") XLSX.openxlsx("Data/chapter04/$fileToSave", mode="w") do xfile #B for file in readdir("Data/chapter04/SongLyrics") println(file) singer, lyrics = readFile("Data/chapter04/SongLyrics", file) lyrics = cleanText(lyrics, stop_signs) wd = addWordsToDict(lyrics) df = DataFrame(singer = String[], word = String[], count = Int64[]) addToDF!(df; artist=singer, wordVec=wd) XLSX.addsheet!(xfile, singer) XLSX.writetable!(xfile[singer], df) end end end saveToSheets("songLyrics2.xlsx")
4.1.2 Delimited Files We have seen how to read data from text files, create data frames and save them as csv or excel files. Today, csv and spreadsheet files are still among the most common formats used for saving data. You will most probably need to get data from these types of files in your projects. Even if that is not the case, you will most probably save interim analysis results for future use as we did above. It may otherwise cause significant loss of time and resource starting from the beginning every time. Delimited files are flat files where columns are separated by delimiters. Csv stands for comma delimited files. It is the most common type of delimited files but not the only one. In Julia, there is also the DelimitedFiles package which comes with the standard library but CSV package is easier to use with dataframes that’s why it is mostly the preferred package. Now we can read the csv file we have saved in the previous section using the CSV.read() function. This function has two positional arguments: source and sink. The file to be read and the function to be used for formatting the result. songLyrics = CSV.read("Data/chapter04/songLyrics.csv", DataFrame)
#A
And what if I want to read from a delimited file which is not comma delimited.
lyricsDelim = CSV.read("Data/chapter04/songLyricsDelim.csv", DataFrame; deli
4.1.3 Excel Files There are various Julia packages to deal with excel files: XLSX, ExcelFiles, ExcelReaders, XLSXReader and Taro. XLSX is the most widely used one among these. Before going on please keep in mind that, an excel file is actually a zip file with multiple XML files. The format of these files is defined by ECMA-376 standards (https://www.ecma-international.org/publications-andstandards/standards/ecma-376/). All packages for reading and writing excel files in any language use these standards. Let’s first see the long way of reading data from an excel sheet for intuition and then the shorter way. The main function to read an excel file is readxlsx(). This function reads the whole excel file at once. When we read an excel file with this function, it shows the contents (sheets and data ranges) of the file after execution. using XLSX, DataFrames xfile = XLSX.readxlsx("Data/chapter04/songLyrics.xlsx") # XLSXFile("songLyrics.xlsx") containing 1 Worksheet # sheetname size range # ------------------------------------------------# Sheet1 142502x3 A1:C142502
We can access sheets in the file either by index numbers or sheet names. sheet1 = xfile["Sheet1"]
# or xfile[1]
We can get the data in the sheet either by gettable() or getdata() functions or using ranges. The first one returns an XLSX.datatable object while the latter two return a matrix.
datatable = XLSX.gettable(sheet1) datamtx = XLSX.getdata(sheet1) datarange = sheet1["A1:C142502"]
We can directly convert an XLSX.datatable object to a data frame. DataFrame(datatable)
To convert the matrix to a dataframe we need two arguments: data and the column names. Following lines will produce the same result. DataFrame(datamtx[2:end, :], datamtx[1,:]) DataFrame(datarange[2:end, :], datarange[1,:]) # 142501×3 DataFrame # Row │ singer word count # │ Any Any Any # ────────┼────────────────────────────── # 1 │ Kanye_West the 2007 # 2 │ Kanye_West i 1867 # 3 │ Kanye_West you 1411 # ⋮ │ ⋮ ⋮ ⋮ # 142499 │ rihanna chick 1 # 142500 │ rihanna wild 1 # 142501 │ rihanna kissing 1 # 142495 rows omitted
As you have seen, this way of reading data from excel files is long and complex. Instead of this we can directly get data from the specified sheet using the readtable() function. This function also returns an XLSX.datatable object which can be easily converted to dataframe. We should specify the file and the sheet in the readtable function. The following code line produces the same result as above. XLSX.readtable("Data/chapter04/songLyrics.xlsx", "Sheet1") |> DataFrame
I have mentioned that readxlsx() function reads the excel file at once. There may be times when you have to deal with very large files which cannot be read at once. In this case we can use XLSX.openxlsx() function for lazy loading. Defining the enable_cache argument as false in the openxlsx() function causes reading data from disk. That may be something we want when the file is too large to fit in the memory. Following code reads the data
from the specified sheet line by line and add each line to a matrix. datamtx = Matrix{Any}(undef, 0, 3) #A fname = "Data/chapter04/songLyrics.xlsx" XLSX.openxlsx(fname, enable_cache=false) do xfile sheet = xfile["Sheet1"] for r in XLSX.eachrow(sheet) XLSX.row_number(r) % 10_000 == 0 && #B println("Row: $(XLSX.row_number(r))") row = Any[r[i] for i in 1:3] #C row = reshape(row, 1, 3) #C global datamtx = vcat(datamtx, row) #D end end
#B
DataFrame(datamtx[2:end, :], datamtx[1,:])
4.2 Summary You can check the contents of a directory, create new directory and new file using readdir(), mkdir(), touch() functions respectively. Files can be opened with read, write and append modes and different combinations of these modes. Too install a package we should write add PackageName in package manager. To start using a package in a session write: using PackageName To write and read data in csv (comma separated values) format we use the CSV package. To write a csv use CSV.write() and to read a csv file use CSV.read() functions. There are various packages to handle xlsx files. The most common one is currently XLSX package. Using XLSX package we can get the sheets in an excel file, read a specific sheet or a given range from a specific range. We can also use lazy loading to read data from very large xlsx files by setting enable_cache argument as false in in XLSX.openxlsx() function.
5 Data Analysis and Manipulation This chapter covers Importing and analyzing dataset as a data frame Converting nonnumerical variables to numerical values Summarizing variables in a data frame Dealing with missing and outlier values Normalizing or scaling variables Analyzing pairwise correlation between variables In the previous chapter we have seen how to read data from different resources and export our data in different formats either for sharing data or for later use. In this chapter, we will go one step further and analyze the data we import. We will see how to get to know our data, and get insights from it. We will also learn how to manipulate our data to prepare it for modeling. All these correspond to the Analysis part in Figure 1.1.
5.1 Project Description One of the most challenging tasks of banks is to assess the creditworthiness of their customers. Banks and other financial institutions develop credit scoring models for this aim. The first step to develop credit scoring models is collecting the data. Besides other reasons, banks collect the historical data for credit applications for later analysis and model development. Credit application data contains two parts: Input data: Historical data for past credit applications. The number of features used in this dataset may be vast. They may include credit bureau data, financial data, past performance data etc. Our input data contains variables about credit limit and exposure, previous overdue payments, check payments, bounced checks etc. as of the credit
application date. Output data: The performance data of the applicants. An observation period after the application date is set. Duration of observation period mostly depends on the maturity distribution of the credit portfolio. Output data is generally binary. This means, if there is a default event in this period (i.e. the customer cannot fulfil a credit payment) then output variable is 1 and 0 otherwise. Our output data contains list of defaulted customers in all banks. It may contain duplicate data as one customer may default for credits from different banks. In this project we will work with the credit application data of a bank. Our main goal is to gain insight about the variables which affect the default probability of a customer. We will analyze, clean and prepare the data for further analysis and try to gain insights from it. I have created the data for this scenario from scratch and tried to include as many real-life use cases as possible. The files we will use in this chapter are: ch05_input_data.csv and ch05_output_data.csv. In high level we will: Import the files of input and output variables as data frames. Remove duplicates. Combine input and output data into one data frame. Convert categorical variables to numerical data to prepare for modeling Get the summary statistics of the variables Deal with missing data Identify and treat outlier values Standardize or scale the data if necessary Find the correlation between input variables and output variable to obtain insights about the relationship between variables. Some of the steps listed above may seem unfamiliar to you. All of them will be explained in the relevant section. Now let’s go through these steps one by one.
5.2 Import Files We will import the input and output data using the CSV and DataFrames libraries. The files are saved in the Data/chapter05 directory of the code
repository (https://github.com/ilkerarslan/J4DSB/tree/main/Data/chapter05). We have already seen how to import data from CSV files in the previous chapter. using CSV, DataFrames input = CSV.read("Data/chapter05/ch05_input_data.csv", DataFrame)
# 42380×78 DataFrame # Row │ custID appDate taxNo custSegment total ⋯ # │ Int64 Date Int64 String15 Int64 ⋯ # ───────┼─────────────────────────────────────────────────────── # 1 │ 9207213177 0002-08-22 19275759 Micro ⋯ # 2 │ 9403556470 0002-08-22 18351569 Medium # 3 │ 5103776071 0002-08-22 18251498 Commercial # ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ # 42379 │ 8642975310 0005-10-22 3086374 Commercial # 42380 │ 8642975310 0005-10-22 3086374 Commercial ⋯ # 74 columns and 42375 rows omitted output = CSV.read("Data/chapter05/ch05_output_data.csv", DataFrame) # 921241×5 DataFrame # Row │ recordDate defaultCode defaultDate customer_ID ⋯ # │ Dates.Date String3 Int64 Int64 ⋯ # ────────┼────────────────────────────────────────────────────── # 1 │ 0030-04-18 AB 20180426 1797575040 ⋯ # 2 │ 0030-04-18 AB 20180426 408307835 # 3 │ 0030-04-18 AB 20180426 872292050 # ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ # 921240 │ 0025-11-22 XY 20210831 7008745332 # 921241 │ 0025-11-22 XY 20210930 7008745332 ⋯ 1 column and 921236 rows omitted
The first thing after importing a dataset is having a first glimpse at the contents of the data frame. We can do it simply by typing the name of the data frame. But if the data frame has many columns or rows the output is cropped to fit for the width and height of the terminal. The three dots in the output imply that data frame is cropped. Also, at the bottom there is a statement which gives the number of omitted rows and columns. If you want to see all of the data frame, you can use the show() function with either or both of allrows=true and allcols=true options. But for large datasets, this may not be feasible. If you want to see it yourself try the following line of code but be ready to wait to see an unreadable output.
show(input, allrows=true, allcols=true)
Looking at the first few rows of the outputs, we see that input data has 42 380 rows and 78 columns while output data has 921 241 rows and 5 columns. Output data is much larger because it contains all default reports from all banks. The first thing we would like to see may be the column names. We can use the names() function for this but the output will again be cropped. That’s why we can use show() function the see all of the column names. The following line of code will list all the column names. I will not show them here to save space but it is always good practice to have a look at the column names of the data. names(input) |> show names(output) |> show
Besides column names we also need data types of the columns. It is always good practice to check for the data types. This is because first you should check whether data is imported in the correct format and second your subsequent codes will depend on that. For example, if a column of dates is imported as string you will not be able to do date operations for this column. If you have a look at the printed data frames, you can see the data types just below the column names. For columns with string data type you see data types like String15 or String3. These are subtypes of InlineString and the numbers give the maximum number of characters in that column. InlineString type is designed for fixed-length strings. Currently, supported types are: String1, String3, String15, String63, String127 and String255. For small datasets, printing the data frame may be enough to check for data types. For large data we will need more. In two lines of code we will first create a dictionary with column names as keys and data types as values. Then we will use the pretty_table() function from PrettyTables package to display the dictionary without being cropped. Below the code, first few lines of the output table are presented. using PrettyTables coldict = Dict(names(input) .=> eltype.(eachcol(input))); pretty_table(coldict, crop=:none) # ┌───────────────────────────┬──────────────────────────┐ # │ Keys │ Values │
# # # # # # # # #
│ String │ Type │ ├───────────────────────────┼──────────────────────────┤ │ totalNoncashLoanLimit │ Int64 │ │ maxOverdueDays │ Int64 │ │ firstCreditUsageDate │ Union{Missing, String15} │ │ maxCheckAmount │ Int64 │ │ undueCheckCount │ Int64 │ │ exposure_L24M_Lag3M │ Int64 │ │ numOverdueAccount │ Int64 │
If we need to see data type of a specific column instead of all of them, we can use the eltype() function for the selected column. eltype(input[!, "firstCreditUsageDate"]) # Union{Missing, String15}
When selecting rows or columns in a data frame, if we can use ! or : to specify all rows or columns. But there is a difference between these two. When we use a colon to extract a column, the column is copied to a separate address in memory and so a change in the column does not affect the data frame. In the above code, we understand that, the data type is String with maximum 15 characters and in addition there are missing values. But the column name suggests that the data type should be Date. We assume that all columns with a “Date” in their name should be of type Date. The following code which is a slight modification of the above code gets current data types of date columns. datecols = names(input, r"Date"); #A Dict(datecols .=> eltype.(eachcol(input[!,datecols]))) # Dict{String, Type} with 10 entries: # "firstCheckSightDate" => Union{Missing, # "appDate" => Date # "oldestDefaultDate" => Union{Missing, # "lastCreditUsageDate" => Union{Missing, # "firstCreditUsageDate" => Union{Missing, # "latestDefaultDate" => Union{Missing, # "latestCreditLineDate" => Union{Missing, # "lastCheckSightDate" => Union{Missing, # "firstBounceDate" => Union{Missing, # "lastBounceDate" => Union{Missing,
Date} Date} String15} String15} Date} String15} Date} Date} Date}
#B
Three of the 10 columns which have a date in their name are not of type Date. Our next step will be converting them to type Date. Luckily, we don’t have to deal with each date column one by one. Having a look at the data (or consulting to business owners) we can see that date columns have the format “dd-mm-yy”. We can provide this to CSV.read() function via dateformat argument and all columns with this format will be imported as dates. Importance of Domain Knowledge
We have checked for columns that should be of type Date by assuming the column names. In practice things may not be as easy. When dealing with a real-life project, I suggest meeting with the owners of the data (business line, IT etc.) and go through the features in the dataset together. The questions you should answer may be data types of columns, valid values, how they are calculated etc. Your aim should be to have a full grasp on the data. using Dates dfmt = dateformat"dd-mm-yy"; input = CSV.read("Data/chapter05/ch05_input_data.csv", dateformat=dfmt, DataFrame); Dict(datecols .=> eltype.(eachcol(input[!,datecols]))) # Dict{String, Type} with 10 entries: # "firstCheckSightDate" => Union{Missing, # "appDate" => Date # "oldestDefaultDate" => Union{Missing, # "lastCreditUsageDate" => Union{Missing, # "firstCreditUsageDate" => Union{Missing, # "latestDefaultDate" => Union{Missing, # "latestCreditLineDate" => Union{Missing, # "lastCheckSightDate" => Union{Missing, # "firstBounceDate" => Union{Missing, # "lastBounceDate" => Union{Missing,
Date} Date} Date} Date} Date} Date} Date} Date} Date}
It seems all date columns are imported as dates but let’s also check it ourselves. input[!, datecols] # 42380×10 DataFrame # Row │ appDate # │ Date
oldestDefaultDate Date?
latestCreditLineDate Date?
latestDef ⋯ Date? ⋯
# ───────┼───────────────────────────────────────────────────────────────── # 1 │ 0022-08-02 missing 0021-10-09 missing ⋯ # 2 │ 0022-08-02 missing 0022-01-06 missing # 3 │ 0022-08-02 missing 0021-12-10 missing # ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ # 42378 │ 0022-09-28 missing 0014-12-28 missing # 42379 │ 0022-10-05 missing 0014-12-28 missing ⋯ # 42380 │ 0022-10-05 missing 0014-12-28 missing # 7 columns and 42374 rows omitted
It seems that there is something wrong. In the input file the years were stored with the last two digits and that caused incorrect years in the data. For example, the date “02-08-22” in the original file is parsed as 0022-08-02 which is actually the year 22 not 2022. We can easily fix this by adding 2000 years to all of the dates. for col in datecols input[!, col] = input[!,col] .+ Year(2000) end # 42380×10 DataFrame # Row │ appDate oldestDefaultDate latestCreditLineDate latestD ⋯ # │ Date Date? Date? Date? ⋯ # ───────┼─────────────────────────────────────────────────────────────── # 1 │ 2022-08-02 missing 2021-10-09 missing ⋯ # 2 │ 2022-08-02 missing 2022-01-06 missing # ⋮ │ ⋮ ⋮ ⋮ ⋱ # 42379 │ 2022-10-05 missing 2014-12-28 missing ⋯ # 42380 │ 2022-10-05 missing 2014-12-28 missing 7 columns and 42376 rows omitted
Now we have to do the same for the output data. Looking closer, we can see that there are two columns which should have Date data type: recordDate, defaultDate. But their format is different. That’s why we will read the csv using one format and then convert the other column on the data frame. The column recordDate has the same format as before. Thus, we will do the same as before. output = CSV.read("Data/chapter05/ch05_output_data.csv", dateformat=dfmt, DataFrame) output[!,"recordDate"] = output[!, "recordDate"] .+ Year(2000)
This fixes the recordDate column but defaultDate column is still in Int64
format. We should convert it first to String and then to Date. You will note that we used different methods to access the columns in the previous code and the following one. Accessing a column with df[!, “colName”] or df.colNAme are same. dfmt2 = dateformat"yyyymmdd" #A int2date(x) = Date(string(x), dfmt2)
#B
output.defaultDate = [int2date(el) for el in output.defaultDate]; output # 921241×5 DataFrame # Row │ recordDate defaultCode defaultDate customer_ID bankCode # │ Date String3 Date Int64 Int64 # ────────┼───────────────────────────────────────────────────────────── # 1 │ 2018-04-30 AB 2018-04-26 1797575040 301 # 2 │ 2018-04-30 AB 2018-04-26 408307835 301 # ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ # 921240 │ 2022-11-25 XY 2021-08-31 7008745332 443 # 921241 │ 2022-11-25 XY 2021-09-30 7008745332 443 # 921237 rows omitted
5.3 Remove Duplicates Please remember that, input data contains historical application data of customers. In the first two columns we have custID and appDate columns which stand for customer ID and application date respectively. If one customer applied for credit at the same date, this will not be an extra observation for us. So, we can remove the duplicates based on custID and appDate. Output data contains default information of customers from different banks. A customer may be reported as default from different banks at the same date. And this is a duplicate information for us. After importing input and output data, one thing that should be checked is duplicate records. Removing duplicates is important because they can cause bias in the data and also, they increase the computation cost. In Julia we can remove duplicates using the unique() function or its mutating version. The first argument of the function is the data frame we
want to work on. Optionally, we can provide the column names to check for uniqueness. Now let’s remove duplicates of the input data based on custID and appDate and output data based on defaultDate and customer_ID. As we use the mutating functions, input and output data will change accordingly. unique!(input, [:custID, :appDate]) unique!(output, [:defaultDate, :customer_ID])
You can check yourselves that input data now has 39.283 rows and output data has 548.035 rows. If you don’t provide any column names all columns are considered for duplicate records. Next, we will combine input and output data.
5.4 Combine Input and Output Data Before combining the input and output data, I want to mention about the combination strategy. First of all, our data is a credit application data. And the aim of collecting this to gain insight about the variables which affect the default probability of a customer. Here, default means not being able to pay the credit back. For the default event, we should choose an observation period. Assume one customer applied for credit on a specific date. We should choose an observation period and if the customer defaults within that period, flag the customer as 1 (default) and otherwise 0. The observation period depends the credit maturity. If specific maturity is not available for each credit line than we can use the average maturity of the credit portfolio. It is also possible to try different time periods (i.e. 6 months, 1 year, 1.5 years etc.) and select the one which gives the highest correlation or model accuracy. Here we will assume the most common period, which is one year. For our data we will select the application date and check whether that customer defaulted within one year. If so, we will flag that observation with one. Notice that, one customer may have applications on different dates. We will accept each one as a separate observation. This is because, even if they belong to the same customer, the variables may have changed in time.
We will follow the following algorithm to select whether a customer defaulted or not in a given observation period. Create an empty array of defaults. The values in this array will be either 1 or 0 (default, nondefault). For each row in the input data Get the customer id and application date Filter the output data for the same customer id and select default dates. If there is a default date within one-year period after the application date then flag this application as 1, otherwise zero. Append the default value to array of defaults Combine the defaults array and input data to create the final dataset. The following code listing applies the above steps in Julia code. You will see that the code itself is much shorter than the algorithm. Listing 5.1 Combining input and output data for default observations
default = Int[]
#A
for row in eachrow(input) id, date = row.custID, row.appDate #B defdates = output[output.customer_ID .== id, :defaultDate] result = date .< defdates .< date + Year(1) #D sum(result) > 0 ? push!(default, 1) : push!(default, 0) end
#C #E
Now, for each application we have the information whether a default event occurred in one year after the credit application. Let’s for example see how many defaults (i.e. ones) are there in the default array. It is simply the sum of the default array. sum(default) # 3409
And what is the ratio of defaults? sum(default) / length(default) # 0.08678054120102843
We can now create a new data frame which contains all of input data and defaults together. We can simply add default array as a new column to the input data. But you may want to keep the input data as a backup if you do something wrong. data = deepcopy(input); data.default = default; data[!, end-2:end] #C
#A #B
# 39283×3 DataFrame # Row │ avgUndueCheckAmount numBanksCheckAccount default # │ Int64 Int64 Int64 # ───────┼──────────────────────────────────────────────────── # 1 │ 346 2 0 # 2 │ 837 9 0 # ⋮ │ ⋮ ⋮ ⋮ # 39282 │ 186 6 0 # 39283 │ 186 6 0 # 39279 rows omitted
Now we have the dataset we can work on. Next, we will check the nonnumerical columns (eg. Date, String etc.) and try to convert them to numerical data.
5.5 Convert Nonnumerical Data Tasks like statistical analysis or machine learning modeling are done with numerical data. We can do descriptive statistics with categorical data which is not numerical but to develop machine learning models we need to convert them to numerical data. Assume you classify your customers into separate segments likewise in our input data. You may want to know whether a customer’s segment affects its default probability. But using values like “Micro”, “Medium”, “Commercial”, “Corporate” in a machine learning model is not possible. We should find a way to represent these values as numbers. Dates may also be important for analysis, depending on the context. For example, we may not need to analyze the distribution of application dates. But using the first credit usage date, we can calculate the time period of the
company in the banking system. Please note that not all numerical data are useful for analysis. For example, we use customer number just to identify different customers but just that. There is no meaning to plot the histogram or calculate summary statistics of customer numbers. Our first task is to get the nonnumerical columns. If we have a small data frame we can do it by checking with eye. But, our data has 80 columns so it is better to write code for that. We have seen how to get the data types of columns before. Now we will create a data frame with two columns: column names and their types. coltypes = DataFrame(column = names(data), type = eltype.(eachcol(data))) # 79×2 DataFrame # Row │ column type # │ String Type # ─────┼──────────────────────────────────── # 1 │ custID Int64 # 2 │ appDate Date # ⋮ │ ⋮ ⋮ # 78 │ numBanksCheckAccount Int64 # 79 │ default Int64 # 75 rows omitted
Now we can filter the columns with nonnumerical data but before that there is an important point you should keep in mind. In Julia, missing values have their own types: Missing. If a numerical column has no missing value it will have a type like Int64, Float64 etc. But assume an Int64 column has missing values. Then what will be the data type of the column: Missing or Int64? For such cases Julia has the Union type. Thus, the data type of the column is Union{Missing, Int64}. After examining the coltypes data frame you will see that most of the columns have the type Int64 or Union{Missing, Int64}. Others are Date or String or union of those with Missing. Now, let’s select Date columns and String columns separately. To show the filtering operation clearly, I have separated the condition we want to filter on and the filtering itself in the following code. Notice the use of broadcasting dot operator. Ref() function is used to treat the referenced
values as a scalar during broadcasting. Let’s start with Date columns. They can be either Date or Union{Missing, Date}. datecond = coltypes.type .∈ Ref([Date, Union{Missing, Date}]) datecols = coltypes[datecond, :] # 10×2 DataFrame # Row │ column type # │ String Type # ─────┼──────────────────────────────────────────── # 1 │ appDate Date # 2 │ oldestDefaultDate Union{Missing, Date} # 3 │ latestCreditLineDate Union{Missing, Date} # 4 │ latestDefaultDate Union{Missing, Date} # 5 │ firstCreditUsageDate Union{Missing, Date} # 6 │ lastCreditUsageDate Union{Missing, Date} # 7 │ firstCheckSightDate Union{Missing, Date} # 8 │ lastCheckSightDate Union{Missing, Date} # 9 │ firstBounceDate Union{Missing, Date} # 10 │ lastBounceDate Union{Missing, Date}
Now, let’s get the String type columns. There is one difference here. As we have seen before, String type columns are actually fixed-width strings like String3, String15 etc. These are all subtypes of InlineString. We can look for subtypes of InlineString in our filtering operation but it is better to use AbstractString which is the supertype of all string types. To test whether type1 is a subtype of type to we write type1