438 105 36MB
English Pages [275] Year 2021
MLHub Desktop Survival Guide Graham Williams Togaware [email protected] 2021-07-07
Preface 20210320
“The enjoyment of one’s tools is an essential ingredient of successful work.” Donald E. Knuth Artificial Intelligence (commonly we just say AI) emerged in the early decades of the 20th century. Significant advances have been made since that time in the ability of machines to learning. And now today we see the emergence of apparently intelligent computer programs through the combination of massive computer power analysing massive amounts of data. One of the more popular jobs today is that of the data scientist, applying skills in statistics, artificial intelligence, machine learning, and data analysis, to gain insights from data. AI, knowledge representation, and reasoning, Machine Learning algorithms, and Data Science skills have delivered new insights and understanding of our world. Whilst many of us see this technology as beyond us, it should not be. Yes, we are delivering sophisticated computer software that seems to behave intelligently, but we need to work to understand the technology, not to driven by some mysterious wizardry. AI has made its way into all of our hands, and it is incumbent upon us to understand it and for it to be able to explain itself. The Machine Learning Hub (MLHub) is a framework and a repository, providing easy access and insights into AI, Machine learning, and Data Science. It supports freely and openly sharing our technology and experiences, to allow more than the geeks to explore new ideas using this
technology. As a repository of packages that capture pre-built demonstrations and models, the hub aims to ensure each package is demonstrable within 5 minutes. The aim of this book is to quickly get started with the MLHub, and to share in the excitement through a simple and productive environment for exploring the state-of-the-art. The MLHub hides the complexity to make the technology accessible. The MLHub repository houses a growing number of curated packages. Each package demonstrates a different technology, quickly. If it looks useful then you can explore and utilise the technology through the package. If not, then move on, having spent only a few minutes to be impressed. After the introductions in the first few chapters of this book, the main body of the book is then a practical hands-on look at the different AI, Machine Learning, and Data Science packages available from the MLHub. The breadth of available packages is comprehensive, and the depth ranges from simple introductory technology to the current state-of-the-art algorithms. The focus is on making it easy for you to use the technology. For a more detailed exploration of AI, Machine Learning, and Data Science see the Data Science Desktop Survival Guide.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
About this Book 20210223
This book is under regular maintenance, and as a work in progress there will be
glitches. Please email [email protected] with feedback, corrections, comments and suggestions. Since beginning the survival guide books in 1995 they have grown in many unexpected directions. My original aim was to capture useful notes for the varied and many common tasks I found myself doing, utilising the tools and packages of GNU/Linux for AI and Machine Learning. I structured the book as one page nuggets of information—each section within a chapter was kept to no more than a single printed page, providing a focus on a single task for each page. The concept of the OnePageR Desktop Survival Guide has worked well over the years, from my personal use and extensive reader feedback. Three Desktop Survival Guides currently exist, the GNU/Linux Desktop Survival Guide, the Data Science Desktop Survival Guide, and the MLHub Desktop Survival Guide. The material from these freely available books has also lead to two published books: Data Mining with Rattle and Essentials of Data Science. A pdf version of this book is available for a small donation which goes towards supporting the development and availability of the book. Please visit MLHub for details. The html version contains the same material and remains freely available from MLHub.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
Technology This book is produced using bookdown. Emacs is used to edit the text. Many will be using RStudio to edit their bookdown documents, which is a generally more friendly environment and is the environment of choice for bookdown support. I’ve used Emacs since 1985 and as a fully extensible “kitchen-sink” type of editor, it has served me well for over 35 years, despite numerous flirtations with “better” editors over my career. RStudio and Visual Studio Code come close to supporting all that is required, but the flexibility provided by Emacs still makes it the leading and most mature integrated development environment (IDE). Bookdown is an rmarkdown based platform for intermixing text with executable code (like Python, R and Shell code blocks). Rmarkdown itself utilises the simple markdown syntax to markup the sections of a document. After running knitr over the rmarkdown material a markdown document is produced, including the output of an commands that were run. Pandoc is utilised to produce html from the markdown document. this can be publsihed to the world wide web. It can also produce pdf output utilising LaTeX, converting the markdown into LaTeX markup, with xetex used to then convert that to pdf. All these tools are open source software and available on multiple platforms, and all for free. Many books are today being written using bookdown. Examples include Data Science at the Command Line (github); Efficient R Programming (github). The MLHub repository itself is implemented using the popular and easy to learn Python programming language on the free and open source Ubuntu distribution of the GNU/Linux operating system. Pacakges are implemented in Python or R. Whilst not necessary for using the MLHub, you too can learn Python or R through many of the introductory resources available on the Internet, including the Data Science Desktop Survival Guide. The GNU/Linux operating system is the target platform for the MLHub whilst also available for MacOS. GNU/Linux is the most widely deployed operating system today, available, for example, from the Microsoft Store under the Windows Subsystem for Linux. It is also a most productive environment for learning about, utilising, developing and deploying AI, Machine Learning, and Data Science. It is a free and open source operating system continually being improved by
thousands of developers for over 30 years. See the GNU/Linux Desktop Survival Guide for a guide to deploying Ubuntu on your computer and to delve much more into using GNU/Linux yourself.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
Terminology GNU/Linux refers to the GNU environment and the GNU and other applications running in that environment on top of the Linux operating system kernel. MacOS refers to the operating system on Apples Macintosh computers. The system is based on Unix and can run many of the MLHub packages. Ubuntu and its underlying base distribution Debian are complete repository based distributions which include many applications pre-built for the particular choice of operating system kernel. The repositories house pre-built packages ready to be installed. X Window System is the common windowing system used in Ubuntu and is a separate complementary component to the operating system itself. Microsoft Windows (or MS/Windows and less informatively just Windows) usually refers to the whole of the popular operating system, from kernel to applications, irrespective of which version of Microsoft Windows is being run, unless the version is important. Microsoft Windows is one of many windowing systems and came on to the screen rather later than the pioneering Apple Macintosh windowing system and the Unix windowing systems. We will refer to MS/Windows version 10 as the last release of this Microsoft operating system, which going forward has snapshot releases rather than new versions.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
Acknowledgements 20210223
There are many people to thank for sharing these tools, their knowledge, and their
encouragement in many different ways. Indeed, the open source communities are characterised by their willingness to share for the good of us all. Many folk have also contributed directly and indirectly to this book through their sharing. Their contributions are acknowledged throughout the book, but there are always gaps. To all who share openly, thank you. I have learned so much from this community over more than 30 years. Your support for maintenance of this book is always welcome. Financial support is used to contribute toward the costs of running the servers used to make this book available. Donations can be made through the PayPal Donation button at MLHub The following have contributed to the content of the book and MLHub with specific material in one way or another. Wee Hyong Tok sparked the idea of MLHub originally with the idea of a repository of pre-built deep neural network models. Simon Zhao and Fang Zhou implemented some of the early prototypes of the system. Anthony Nolan and many others have provided insights and comments that have been incorporated somewhere in the software and/or book. Thanks. The following are further acknowledged for their support of the book:
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
Freedom, Utility, and Copyright 20210221
The basis of the ecosystem we are creating is freedom: the freedom to choose, the
freedom to learn, the freedom to change, the freedom to share, the freedom to contribute, the freedom to live, the freedom to enjoy, the freedom to communicate through the code and tools we write, and the freedom of and the right to privacy. Never let our freedom be taken from us. And through utilising the tools and technology that I make available to you, please give back freely. That may be through a donation to support the ongoing development, or contributions of code, bug fixes, extensions, and packages to MLHub. Writing of this book began in 1995 in one form or another, and the material continues to be updated as the technology develops. The procedures and applications presented in this book and through the various MLHub packages have been included for their instructional value. They have been tested at various times over the years but are not guaranteed for any particular purpose. We also note that functionality of different applications can change over time and whilst we make an effort to update the material the sheer volume presents a challenge. The publisher, togaware.com, does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs and applications. This work is copyright by Togaware and licensed under the Creative Commons AttributionShareAlike 4.0 license. It is made freely available to serve as a useful resource for users of Free and Open Source Software, in the hope that it serves as a useful resource, and that you might also consider contributing to our ecosystem of freedoms. Copyright © Togaware Pty Ltd
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
1 The Emergence of AI 20200223
Artificial Intelligence (AI) gets variously redefined over time. The origins date from a
workshop in 1956 at Dartmouth College in the USA funded by the Rockefeller Foundation. The workshop agreed on the term artificial intelligence and proposed that it encapsulate: an attempt to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. … For the present purpose the artificial intelligence problem is taken to be that of making a machine behave in ways that would be called intelligent if a human were so behaving. That captures it nicely. There’s language for communications, representing knowledge as abstractions and concepts, solving problems that are the domain of humans, and learning. These are also at the core of human intelligence. Data Science gained currency in 2014 to refer to the endeavour of using AI and Machine Learning (and Statistics) to gain new insights and knowledge from data. It aims to translate the fundamental research developments in AI and Machine Learning and making them available and applicable to real problems. Since the 1950’s we have seen four seasons for AI. The 1950’s saw reasoning as search with developments in natural language, micro worlds, and neural networks. The 1980’s saw a surge in expert systems, knowledge representation, and back propagation. The 1990’s saw a focus on agents with Deep Blue, intelligent agents and the emergence of Data Mining. The 2010’s saw the emergence of massive data and massive compute, with data science and deep learning made possible by the coupling of accessible massive compute on the cloud and the centralised collection of massive amounts of personal data. The Future will see complex knowledge capture and reasoning, widespread access to AI, and massively distributed but privately managed data. The current era continues to set the foundations for computing machines that will demonstrate AI, though that remains some way off. We see the foundations emerging from the research laboratories in universities and in industry. The MLHub shares the results of these developments, simply, within a framework that is easy for you to build models to share and a platform that makes it easy for anyone to explore and utilise the technology.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
1.1 Practical Tools for AI The scope of this book is considerable, to cover the practical capabilities of AI, machine learning and data science as it is today. We throw in a few whimsical tools too, for the fun of it.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
2 ML Hub App THIS CHAPTER IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20210625
The MLHub app is under development and is being designed to reduce the barrier to entry
for utilising MLHub.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
2.1 MLApp Home Screen THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20210625
The home page of the application provides a scrolling list of the most popular categories
(Rooms) of AI, ML, and Data Science tasks.
A random choice of two packages is listed for each of the rooms, with an indication of the total number of packages in each room. The list only includes those rooms that have a number of packages above a threshold (see the Settings). A full list of Rooms is available through XXXX.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
2.2 MLApp Package Room THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20210322
A sample Room containing the selection of MLHub packages that deal with Classification
tasks is show.
Tap any particular package to display that package’s page.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3 MLHub Command Line 20200901
The machine learning hub (MLHub) is a framework and repository, through which the
capabilities of Machine Learning, Artificial Intelligence, and Data Science are presented and accessible. Pre-built Machine Learning and Artificial Intelligence models as well as Data Science best practices are presented as packages. Each package wraps its functionality into commands that are able to be readily deployed within traditional and powerful Unix/Linux command line pipelines. MLHub exposes a git software repository as a collection of quickly accessible and ready to run, explore, rebuild, and even deploy, pre-built machine learning models and data science technology. A growing number of machine learning models and data science technology are becoming available, as well as cloud based services. Each MLHub package provides a demo command to interactively demonstrate the capabilities of the package. Many packages also include a gui command (graphical user interface) through which to explore the capabilities of the package. A collection of command line oriented commands are then provided by each package to enable the user to explore and utilise the capabilities of AI and Machine Learning algorithms. Whilst enabling the power of the command line is an important goal of MLHub, the usefulness and relevance to an end user of the capabilities of the package must be demonstrable within about 5 minutes. The user can then decide whether the package supports something useful for them by which they can delve more deeply, or move on to something more interesting, having lost only 5 minutes of their time. The models and technology are accessed and managed using the ml command from the free (as in libre) and open source mlhub software. The software is available for installation through pypi.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open
source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.1 Installing MLHub on Ubuntu 20200220
MLHub runs on the Ubuntu platform and is implemented in Python3. All of the curated
models that are registered with MLHub are tested against Ubuntu LTS (Long Term Support). Ubuntu can be installed on almost anything from a Raspberry Pi to a desktop or laptop running Ubuntu directly or through a virtual machine, or via the Windows Subsystem for Linux (WSL). Ubuntu is the most widely deployed operating system on cloud servers, on smart devices (as Android), and is even the operating system of choice for the helicopter on Mars. The various options for installing Ubuntu are covered in the GNU/Linux Desktop Survival Guide. Once you have Ubuntu installed MLHub is easy. For a new Ubuntu server we might first install wajig to simplify using Ubuntu. It is available from the PyPI software repository. Installation of wajig will usually take less than 5 minutes. $ sudo apt update $ sudo apt upgrade $ sudo apt install wajig $ wajig update $ wajig upgrade $ wajig install python3-pip $ pip3 install wajig
Be sure to log out and log back in after the pip3 install so that the system will notice your local installations. This will refresh the command PATH to ensure that it includes ~/.local/bin . Pip3 installs the ml command there. If all else fails then the following could be useful (but not usually required): $ echo 'PATH=~/.local/bin:$PATH' >> ~/.bashrc
Now we are ready to install and configure MLHub also from the PyPI software repository using the pip3 command:
$ pip3 install mlhub
After installation the system can be configured: $ ml configure
The ml command should now be ready to use. Getting started is now simple. Choose from amongst the packages of interest to you from the package catalogue. As a data scientist you may be interested in visualisations (ports), beeswarm, and animations (animate). For traditional machine learning there are models for rain prediction (rain) and movie recommendation (movies). For pre-built deep neural network models you can find models to colorize photos (colorize), identify objects (objects), to make you computer see with computer vision (azcv), or to detect faces (facedetect). Explore, enjoy, share, and empower. Above all, let’s work toward a collective purpose of ensuring we have a meaningful future for humanity.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.2 Hello World 20200311
MLHub supports any number of commands that are exposed through the individual model
packages. MLHub itself implements the following core commands. Note that everything from the # to the end of the line in the following code block is ignored (it’s a comment). $ ml
# Summary of commands supported by ml.
$ ml configure
# Configure the mlhub package itself.
$ ml available
# List of currated models on the MLHub repository.
$ ml installed
# List of models installed locally.
$ ml install
# Install a model.
$ ml configure
# Install the model's required dependencies.
$ ml readme
# View the author's introduction of the model.
$ ml commands
# List commands supported by the model.
$ ml uninstall
# Uninstall the model and (optionally) model cache.
Once MLHub is installed run one of the Hello World examples. A simple one is the rain model from Rattle which demonstrates the use of the decision tree machine learning modeller to predict the likelihood of it raining tomorrow. If it predicts rain, then I’ll take an umbrella with me to work, otherwise no need. The example comes from my Data Mining book. This uses the free and open source R statistical software package which will have been installed when you configured MLHub. The following sequence of commands illustrate the typical workflow for many MLHub packages: $ ml install
rain
# Install the pre-built model named 'rain'.
$ ml configure rain
# Configure any dependencies for the model.
$ ml readme
rain
# View background information about the model.
$ ml commands
rain
# List the commands supported by the model.
$ ml demo
rain
# Run the demonstration of the pre-built model.
Different packages will have different system dependencies and these will be installed by the configure command. After configuration it is useful to review the packager’s commentary in their readme. The list of commands supported by the package is provided by commands.
Most model packages will support the demo command. The command will demonstrate the capabilities of the package. Some packages also support the gui command which will provide a graphical interface to the package’s functionality. $ ml gui
# Graphical display to utilise the model.
The remaining commands supported by a package then provide specific functionality usually in a manner suitable for command pipelines (see Section 4). A list of the individual package commands is provided through the commands command.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.3 ml available 20200221
After installing MLHub (Section 3.1) we are ready to install MLHub packages. The simplest
option is to install curated packages. Such packages are reviewed by the MLHub team and the specific details required to install a package are obtained from a MLHub maintained index. The MLHub team review these packages to ensure their integrity and functionality. There is though no limit to what models can be packaged for MLHub. MLHub is able to be pointed to any git repository and install a package based on the MLHUB.yaml file found there. The available command lists the available curated packages. These pacakges can be installed simply through the name of the package (the left hand column).
$ ml available The repository 'https://mlhub.ai/' provides the following models:
animate
2.1.5
Tell a data narative through animations
audit
4.1.0
Classic financial audit predictive classification model.
azanomaly
3.1.4
Azure Anomaly Detection.
azcv
2.6.0
Azure Computer Vision.
azface
2.1.4
Azure Face API demo.
azlang
0.0.3
Azure language cognitive service on the cloud.
azspeech
4.1.1
Azure Speech cognitive services on the cloud.
aztext
2.4.7
Azure Text Analytics cognitive services on the cloud.
aztranslate
2.4.6
Azure Text Translation cognitive services on the cloud.
barchart
2.0.2
Demonstrate the concept of barcharts.
beeswarm
2.0.1
Demonstrate the concept of bee swarm charts.
cars
0.0.9
Identify car make and model from a photo.
colorize
1.5.8
Demonstrate the concept of photo colorization.
easyocr
0.0.8
Extract text from images.
facedetect
0.2.5
Simple face detection.
facematch
0.4.2
Simple face recognition.
iris
2.1.3
Classic iris plant species classifier.
movies
2.0.3
Movie recommendation using the SAR algorthm.
objects
1.6.26 Recognise objects in an image using resnet152.
opencv
1.0.2
OpenCV Computer Vision.
ports
2.0.0
Demostrate the concept of visualising data.
pyiris
0.0.7
Classification models in Python using the iris dataset.
rain
5.1.3
Predict if it will rain tomorrow (decision tree and rand...
scatter
2.0.1
Demonstrate the concept of scatter plots.
sgnc
0.1.0
Node classification for graphs using StellarGraph.
speech2txt
0.1.1
Convert audio speech to text across multiple services.
To install a named model, local model file or URL:
$ ml install
These are only the curated packages. Any MLHub package can be installed through reference to it’s GitHub repository. See Section 3.4 for details.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.4 ml install 20210414
The simplest installation identifies a curated package name. The names of these packages
are those in the left column of the output of available). For example: $ ml install rain
# Install the curated package named rain.
The package’s MLHUB.yaml file, containing all information required by MLHub to install the package, is retrieved from a git repository identified through MLHub’s meta-data. All MLHub packages are hosted within a git repository. We can explicitly identify the GitHub path, thus skipping the curated package list: $ ml install gjwgit/rain
# Install rain from its GitHub repository.
The default action of the install command is to access MLHUB.yaml from the git repository. The above example will obtain the package’s configuration from https://raw.githubusercontent.com/gjwgit/rain/master/MLHUB.yaml. The repository’s default branch is accessed (master in this case). The ports package identified as gjwgit/ports uses main as the default branch and so the MLHUB.yaml file is retrieved as https://raw.githubusercontent.com/gjwgit/ports/main/MLHUB.yaml. Specific branches and commits of a git repository can also be identified: $ ml install gjwgit/rain@dev
# From dev branch.
$ ml install gjwgit/rain@a24e268
# From specific commit.
Specific MLHUB.yaml files within a repository can also be supplied: $ ml install gjwgit/rain:doc/MLHUB.yaml $ ml install https://github.com/gjwgit/rain/testing/MLHUB.yaml
The default git repository is GitHub, but it can be explicitly identified:
$ ml install github:gjwgit/rain
Similarly other git servers supported include GitLab and BitBucket: $ ml install gitlab:gjwgit/rain $ ml install gitlab:gjwgit/rain@2fe89kh:doc/MLHUB.yaml
$ ml install bitbucket:gjwgit/rain
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.5 ml configure 20210418
The configure command can be used with or without naming a package and accepts the
-y or --yes option. $ ml configure [] -y
--yes
Answer yes to any questions.
Without a package name the MLHub package itself is configured. ml configure
This will install quite a comprehensive collection of AI technology to have your computer AI-ready. This will include several hundred packages (mostly small) that are downloaded and installed. For each of the major packages you will be asked to confirm that it is okay to install it. This could take up to 5 minutes. Included is the R Statistical Software package. As the instructions will suggest, run the following to immediately turn on tab expansion for mlhub commands and model names. This will be available anyhow on the next login. $ source /etc/bash_completion.d/ml.bash
This will ensure all of the required dependencies for using MLHub are installed on your computer, including both Python and R and some of the basic common packages for both of them. If a package name is provided then the dependencies for that package are installed. ml configure rain
The -y or --yes option will install all package requirements without asking:
$ ml configure --yes rain
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.6 ml readme 20210503
The readme command shows about a page or screen of details about the package. The
full README file is usually available directly from the git repository, and for our purposes the readme command will show the top of the package’s README file, up to but not including the Usage section. A package developer will often include a simple introduction to the package and a short quick start guide before the details of the usage section and beyond that make up the remainder of the README.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.7 ml demo
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.8 ml gui
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.9 ml commands 20210420
A MLHub package can expose any number of commands. The commands command will
list the commands supported by the package. It is expected that for the same functionality different packages will use the same command name. Here is a list of known commands:
$ ml adult
pkg
# Does image contain questionable material.
$ ml analyze
pkg
# Analyze an image.
$ ml brands
pkg
$ ml build
pkg
$ ml category
pkg
$ ml celebrities pkg $ ml color
pkg
# Colorize a (black and white) photo.
$ ml describe
pkg
$ ml faces
pkg
$ ml geocode
pkg
$ ml identify
pkg
$ ml landmarks
pkg
$ ml language
pkg
$ ml limits
pkg
$ ml links
pkg
$ ml objects
pkg
$ ml ocr
pkg
$ ml phrases
pkg
$ ml predict
pkg
$ ml sentiment
pkg # Sentiment of a sentence.
$ ml supported
pkg
# What the package supports. E.g., languages.
$ ml synthesize
pkg
# Synthesize speech from text.
$ ml tags
pkg
$ ml thumbnail
pkg
# Create an effective thumbnail for the image.
$ ml train
pkg
# Train a model based on new data.
$ ml transcribe
pkg
# Transcribe audio from the microphone.
$ ml translate
pkg
# Translate between languages.
$ ml type
pkg
# Identify onjects in a photo.
# Report on any limits to the package.
# Optical character recognition.
Most commands also support command line options which always begin with a single dash for a single letter command line option or a double dash for more explicit commands. Command line options tend to be common across different packages and include:
$ ml command pkg [options] [argument] -b
--bing
Generate Bing Maps URL.
-i --input=
Input data.
-g
--google
Generate Google Maps URL.
-h
--header
Output a header line for the CSV.
-l
--lang=
Target language.
-m
--max=
Maximum number of matches.
-o
--osm
Generate Open Street Map URL.
-o --output=
Save audio to file.
-t
--to=
The code for target language, e.g., fr.
-u
--url
Generate Open Street Map URL.
-v
--verbose
More information is output.
--version
MLHub or package version.
-v
--voice=
Selected voice.
-y
--yes
Answer yes to any questions.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.10 ml uninstall 20200310
To uninstall a package and thus recover any disk space it might be using we can use the
uninstall command. This will also prompt for the removal of the cache maintained for this package, which is often where the larger downloads are stored. $ ml uninstall rain Remove '/home/kayon/.mlhub/rain/' [Y/n]?
Remove cache '/home/kayon/.mlhub/.cache/rain/' as well [y/N]? y
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
3.11 Tips 20200221
For R based models it is often useful to install some R packages through the operating
system, or else locally by a user. For the latter case some useful packages to pre-install are identified below. This can be done at any time, but is useful before installing any of the R based MLHub packages. They will not then individually need to install the packages for themselves. $ R > install.packages(c("rpart", "tidyverse"))
Similarly for common Python dependencies. One particular example is tensorflow which does not have a Ubuntu package and thus is installed using pip3. This can be installed any time, and any mlhub package that requires tensorflow will not need to install it separately. $ pip3 install tensorflow
If a model has installed badly, got corrupted, or not working as expected, sometimes an uninstall followed by install will fix the problem. When uninstalling in these circumstances it is usually a good idea to remove the cache as well: $ ml uninstall objects Remove '/home/kayon/.mlhub/objects/' [Y/n]? y Remove cache '/home/kayon/.mlhub/.cache/objects/' as well [y/N]? y
$ ml install objects
Commands Auto Completion The bash shell on Linux supports command line auto-completion which is pretty handy. You can download ml.bash from the MLHub and place the file into ~/.local/share/bashcompletion/completions/ for recent versions of bash, or into the system-wide location /etc/bash_completion.d/ . The configure command installs the file into the system-wide
location. Be sure to restart the shell for the auto-completion to take effect.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
4 Pipelines 20201024
A general mlhub philosophy is that the output from a command should be, for example, a
well defined text format. Typically this will use a csv (comma separated value) format and will be consistent so that follow-on processes within a pipeline can further process the results. These might even be other mlhub models. The mlhub commands focus on their specific task, not solving all problems, but implementing their specific task well. We can then leave extra processing to other specialist tools, like sed, or cut, and awk. This example deploys an optical character recognition capability from the ocr command of the azcv model: $ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2 51.0 43.0 668.0 51.0 667.0 85.0 51.0 77.0,My cats name is freckles. She like's to cl 37.0 97.0 691.0 104.0 690.0 134.0 37.0 128.0,high. She is 2 years old. She likes to
$ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2 | sed 's/,/\t/' 51.0 43.0 668.0 51.0 667.0 85.0 51.0 77.0
My cats name is freckles. She like's to
37.0 97.0 691.0 104.0 690.0 134.0 37.0 128.0
high. She is 2 years old. She likes
If you do not care for the bounding boxes that is output by default from the ocr command then simply remove them using cut: $ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2 | cut -d, -f2My cats name is freckles. She like's to climb up high. She is 2 years old. She likes to play a lot of games.
We can process every jpg image file in a directory where we may have several hundred files. We will save the text output into a txt file. The following pipeline utilises a for loop, an ml model, and the cut command:
$ for f in images/*.jpg; do echo "=====> $f"; ml ocr azcv $f | cut -d, -f2- > $(dirname $f)/$(basename $f .jpg).txt; done
Change the two instances of jpg to png to process png image files, and similarly for pdf files. Here we transcribe spoken English into text and then translate that text into Persian (Farsi) using azspeech2txt and aztranslate: $ ml transcribe azspeech2txt friend.wav | ml translate aztranslate --to=fa en,1.0,fa,... ,اﯾﻦ ﯾﮏ آزﻣﺎﯾﺶ اﺳﺖ ﺗﺎ ﺑﺒﯿﻨﯿﺪ ﮐﻪ ﭼﮕﻮﻧﻪ ﻫﻤﻪ ﭼﯿﺰ ﺑﻪ ﺧﻮﺑﯽ ﺿﺒﻂ.
A compelling example of a pipeline is to transcribe our English utterances, translate to French and then synthesise into a female French voice using a combination of azspeech and aztranslate. Here it is: $ ml transcribe azspeech | ml translate aztranslate --to=fr | cut -d',' -f4- | ml synthesize azspeech --voice=fr-FR-HortenseRUS
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
4.1 Adding Bounding Boxes to a Photo 20210317
Many of the computer vision models will identify bounding boxes for objects within a photo. For example, the faces command of the azcv package returns as the first field the bounding box of any faces found in a photo, one face per line. A relatively simple pipeline can add the bounding boxes for the identified faces to the image. As an example, we first download an image from the Internet saving it as the file faces.jpg using wget. wget https://bit.ly/38GgwPP -O faces.jpg
The 10 model is then called upon to identify the faces, saving the output to a text file faces_bb.txt , containing the bounding boxes. ml faces azcv faces.jpg | tee faces_bb.txt
This text file is concatenated to the cut command to extract the first field ( -f1 ) where fields are denoted by a comma ( -d, ). This field is the bounding box of each face. Using xargs and awk a command is constructed using convert from imagemagick to draw the blue rectangles of width 3 pixels for each of the identified faces, saving the resulting image as faces_tmp.png .
cat faces_bb.txt | cut -d, -f1 | xargs printf '-draw "rectangle %s,%s %s,%s" ' | awk '{print "faces.jpg -fill none -stroke blue -strokewidth 3 " $0 "faces_tmp.png" xargs convert
If a polygon of 4 points rather than a rectangle is returned by the model, then: $ ml detect azface 3818.jpg | grep forehead_occluded | cut -d, -f1 | xargs printf "-draw 'polygon %s,%s %s,%s %s,%s %s,%s' " | awk '{print "3818.jpg -fill none -stroke red -strokewidth 5 " $0 "bb.png"}' | xargs convert
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
5 Animations THIS CHAPTER IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20200220
Animations can add considerable insight to any data analysis and can communicate quite
effectively the story that the data is telling. The MLHub package, animate, illustrates animations. The sports animation used here is based on example R code posted to Twitter by Victor Yu, in 2018. The data comes from the International Association of Athletics Federations (IAAF). To get started: $ ml install animate $ ml configure animate
A demonstration of generating an animation is available through the demo command. The build command can be used to generate a similar animation from user supplied data.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
5.1 animate demo 20210318
The demo command illustrates how effective an animation can be by engaging the viewer
with insights that a single graphic can not so readily convey. This can add significantly to the narrative that we are telling through the data and is a fundamental tool for the Data Scientist. Of course, a printed page can not show the animation (yet).
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
5.2 animate build TODO
The build command will take a CSV file (e.g., mydata.csv ), either locally or a URL, and
generate an animation based on the data in the file. The file needs to have three columns, one named id (such as an athlete’s name), one named event (such as different sporting events), and one named rank. A plot similar to the animation generated for the IAAF data will be produced, saving it by default as mydata.gif . Options include: -O or --output= to name the output file into which the image is to be saved. The
filename extension of the specified file will be used to determine the format type; -t or --type= to specifiy the image format output as gif (default) or png.
Give it a try on your own data.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
6 Australian Shipping Ports THIS CHAPTER IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20210324
Package: ports.
How do we get insights into any data that we might collect? Over and over again we discover that when we visualise our data in some way we see things that are simply not so visible trawling through tables of numbers. Visualisations of our data is such a crucial step in gaining insight into the data. This MLHub package illustrates a variety of plots that bring our data alive. The examples presented through the ports package comes from Chapter 5 of the Essentials of Data Science with the data available from Togaware as an Excel spreadsheet. This is real data and the plots presented here are based on the plots presented in an actual policy report from the Australian Government: Ports: Job generation in a context of regional development, Australian Bureau of Infrastructure, Transport and Regional Economics. To install, configure, and demonstrate the package: ml install
ports
ml configure ports ml readme
ports
ml commands
ports
ml demo
ports
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
6.1 ports demo 20210322
The demonstration aims to highlight the key concept of data visualisation.
ml demo ports
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
7 iris Plant Species Prediction THIS CHAPTER IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20200220
$ ml install iris
$ ml configure iris
$ ml demo iris
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8 Rain Prediction 20210519
Package: rain.
How do we go about representing knowledge in AI? When we build a model using machine learning we need to have a larget language that allows us to represent that model. A quite simple to visualise target language is a decision tree. Decision trees are covered in quite some detail in the Data Science Desktop Survival Guide. Decision trees and the ensemble of decision trees within a random forest are both very common approaches to building classification type models in AI. The concept of an ensemble of decision trees was introduced in my 1988 paper Combining Decision Trees: Initial results from the MIL algorithm where the improved performance from multiple trees is demonstrated. The rattle package in R provides the weatherAUS dataset which is used to predict if it will rain tomorrow (or any other target variable of choice). The package provides a predictive model for the probability of it raining tomorrow based on today’s weather observations. The training dataset consists of daily weather observations from weather stations across Australia capturing the amount of sunshine, the humidity, the amount of rain today, etc. This simplest of approaches uses the decision tree induction algorithm to build a model that captures knowledge in the form of a decision tree. Other (often more accurate but more complex) models include the random forest which builds a forest (that is, a collection) of decision trees and produces an ensemble model. Ensembles have been shown over many years to produce more accurate models. The example model and code come from my Essentials of Data Science. We install, configure and demonstrate the model with these three commands: ml install
rain
ml configure rain ml readme
rain
ml commands
rain
ml demo
rain
In addition to the demo command, the package supports the following commands: predict.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.1 rain demo 20210322
The demonstration introduces the concept of decision tree induction as a machine learning
algorithm and knowledge representation language. $ ml demo rain
Predicting Rain Tomorrow - Decision Tree AI models can be built from historic data and deployed to provide some degree of accuracy in their prediction. The model here is based on a dataset from a collection of weather observations at different locations over several years. How well it performs is dependent on the training data and the locations at which the model is to be deployed.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.2 rain demo data 20210418
Prepare the Data The weatherAUS data comes from the Rattle package (https://rattle.togaware.com). It covers some 50 weather stations in Australia with over 10 years of daily observations of some 20 variables. The data is loaded, cleansed and wrangled, and prepared for modelling, as explained in the OnePageR chapter on data templates: https://onepager.togaware.com/Chapter_Data_Template.html. A view of the data is shared below.
Rows: 176,747 Columns: 24 $ date
2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008…
$ location
"Albury", "Albury", "Albury", "Albury", "Albury", "Al…
$ min_temp
13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.…
$ max_temp
22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9,…
$ rainfall
0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0…
$ evaporation
4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8…
$ sunshine
8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5…
$ wind_gust_dir
w, wnw, wsw, ne, w, wnw, w, w, nnw, w, n, nne, w, sw,…
$ wind_gust_speed 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 4… $ wind_dir_9am
w, nnw, w, se, ene, w, sw, sse, se, s, sse, ne, nnw, …
$ wind_dir_3pm
wnw, wsw, wsw, e, nw, w, w, w, nw, sse, ese, ene, nnw…
$ wind_speed_9am
20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4…
$ wind_speed_3pm
24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20,…
$ humidity_9am
71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 6…
$ humidity_3pm
22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43…
$ pressure_9am
1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009.…
$ pressure_3pm
1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008.…
$ cloud_9am
8, 5, 5, 5, 7, 5, 1, 5, 5, 5, 5, 8, 8, 5, 5, 0, 8, 8,…
$ cloud_3pm
5, 5, 2, 5, 8, 5, 5, 5, 5, 5, 5, 8, 8, 7, 5, 5, 1, 1,…
$ temp_9am
16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3,…
$ temp_3pm
21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2,…
$ rain_today
no, no, no, no, no, no, no, no, no, yes, no, yes, yes…
$ risk_mm
0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2…
$ rain_tomorrow
no, no, no, no, no, no, no, no, yes, no, yes, yes, ye…
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.3 rain demo fit the model Fit the Model Given the historic data which records the outcome we wish to predict (rain_tomorrow) we can fit a model based on that data so as to predict the outcome for new data. The model will be built on a random sample of 70% (123,722) of the observations. This is the training dataset. For the demo command the model is built interactively.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.4 rain demo decision tree Display the Decision Tree An AI model targets a specific knowledge respresentation langauge. Here the knowledge is represented as a decision tree. We can gain insight into the model through a textual representation of the decision tree as below. The first line in the description reports the number of observations in the training dataset. The line begining with ‘node)’ is a legend. Split is a test condition, n is the number of observations that have made there way to this node, the loss is the error in the prediction at this node, the yval the majority class (i.e., the prediction), and yprob is class probability. n= 123722
node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 123722 49488.800 no (0.6000000 0.4000000) 2) humidity_3pm< 64.5 92796 20592.860 no (0.7514612 0.2485388) 4) wind_gust_speed< 51 77929 13536.100 no (0.7989322 0.2010678) * 5) wind_gust_speed>=51 14867
7056.761 no (0.5457407 0.4542593)
10) humidity_3pm< 45.5 8817
2742.284 no (0.6713901 0.3286099) *
11) humidity_3pm>=45.5 6050
2875.072 yes (0.3998960 0.6001040) *
3) humidity_3pm>=64.5 30926 11970.350 yes (0.2929151 0.7070849) 6) humidity_3pm< 79.5 20058 12) rainfall< 1.15 12726
9709.632 yes (0.4119874 0.5880126) 6607.019 no (0.5155528 0.4844472)
24) wind_gust_speed< 47 10144 25) wind_gust_speed>=47 2582 13) rainfall>=1.15 7332
4398.327 no (0.5749978 0.4250022) * 1080.620 yes (0.3285246 0.6714754) *
2678.388 yes (0.2697397 0.7302603) *
7) humidity_3pm>=79.5 10868
2260.721 yes (0.1306888 0.8693112) *
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.5 rain demo visual decision tree Visual Decision Tree A visual representation of a model can often be more insightful than the printed textual representation. A decision tree model can readily be visualised as a tree structure as we will see. The tree is read from top to bottom, traversing the path corresponding to the answer to the question presented at each node. The leaf node has the final decision together with the class probabilities.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.6 rain demo variable importance Variable Importance One aspect of understanding the data and models that we build is what variables play the most significant role in predicting the outcome. The variables that are actually end up in the model are: humidity_3pm, rainfall, wind_gust_speed. All variables are considered in the modelling and the relative importance of each variable in predicting the outcome is determined. Note that in the table below the actual numbers represent the relative importance of that variable. Relative Importance of Variables:
humidity_3pm wind_gust_speed
rainfall
humidity_9am
cloud_3pm
58
8
7
6
6
temp_3pm
sunshine
wind_speed_3pm
wind_speed_9am
max_temp
5
5
2
1
1
pressure_9am 1
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.7 rain demo visual variable importance Visual Importance Once again a visual presentation of the variable importance can be more effective in conveying the relative importance.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of
Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.8 rain demo variable selection Variable Selection When the model was built, the algorithm chooses a variable for each node of the resulting decision tree. An entropy, information theory or gini based calculation is used to choose the variable. The variable with the highest value according to this measure is chosen for the particular node. Below we will see the calculations that were made for the root node of the tree (Node Number 1). A number of variables were considered and the variable with the top score was chosen for this node. The improve= is the value of the calculation. Node number 1: 123722 observations, predicted class=no
complexity param=0.342
expected loss=0.4
P(node) =1
class counts: 97753 25969 probabilities: 0.600 0.400 left son=2 (92796 obs) right son=3 (30926 obs) Primary splits: humidity_3pm < 64.5
to the left,
improve=11510, (0 missing)
rainfall
< 0.35
to the left,
improve= 7486, (0 missing)
rain_today
splits as
cloud_3pm
< 6.5
LR,
improve= 7133, (0 missing)
to the left,
improve= 5030, (0 missing)
to the left,
improve= 4535, (0 missing)
to the left,
agree=0.778, adj=0.112, (0 split)
humidity_9am < 87.5
to the left,
agree=0.775, adj=0.100, (0 split)
sunshine
< 3.25
to the right, agree=0.772, adj=0.088, (0 split)
temp_3pm
< 12.55 to the right, agree=0.771, adj=0.085, (0 split)
rainfall
< 4.85
humidity_9am < 73.5 Surrogate splits: cloud_3pm
< 7.5
to the left,
agree=0.766, adj=0.063, (0 split)
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
8.9 rain demo predicting rain tomorrow Predicting Rain Tomorrow We now use the model to make predictions. The decision tree model is applied to a previously unseen (by the mode) random subset of the dataset of daily observations, the tuning dataset. This dataset contains 26,512 observations. This provides an insight into the performance of the model on new/unseen data. The performance here is okay based on this dataset. Note any highlighted errors. No model is perfect. Actual Predicted Error 1
no
no
2
no
yes Recognized: Welcome to a demo of the prebuilt models speech provided > through Azure as cognitive services. The speech cloud service provides > speech to text and text to speech capabilities.
Press Enter to continue:
Now type text to be spoken. When Enter is pressed you will hear the result.
> Welcome to a demo of the prebuilt models for speech.
The first paragraph from the screen was read and the Azure Speech to Text service was mostly accurate in its transcription. For synthesis the same text was used and could be heard through the system speakers.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
15.4 azspeech synthesize THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER The synthesize command will generate spoken word audio, spoken by a human sounding voice, from supplied text, and will play the audio on the system’s default audio output. With -o or -output a wav file can be specified as the output rather than having the audio played through the
speakers. $ ml synthesize azspeech [sentence] -f
--file=
Text to be spoken.
-l
--lang=
Target language.
-o --output= -v
Save synthesized audio to file.
--voice=
The simplest usage is to synthesise the sentence provided on the command line: $ ml synthesize azspeech Welcome my son, welcome to the machine.
The spoken language can be chosen, though this will attempt to pronounce the words as if they are French: $ ml synthesize azspeech --lang=fr-FR It's alright, we know where you've been.
$ ml synthesize azspeech --voice=en-AU-NatashaNeural You brought a guitar to punish
$ echo "It's alright, we told you what to dream" | ml synthesize azspeech
$ ml synthesize azspeech --file=short.txt
$ ml synthesize azspeech --lang=de-DE --file=short.txt
$ ml synthesize azspeech --voice=fr-FR-DeniseNeural --file=short.txt
The supported languages and their locale codes (BCP-47) are listed at Azure Docs.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
15.5 azspeech transcribe 20210609
The transcribe command will, by default, listen for up to 15 seconds of speech from the
microphone and then convert it to text, written to the console. The command can also be used to transcribe speech from an audio file (wav). The source language may be required, though several languages are automatically identified. $ ml transcribe azspeech -i
--input=
-l
--lang=
A simple example, listening for the audio on the microphone: $ ml transcribe azspeech The machine learning hub is useful for demonstrating capability of models as well as providing command line tools.
The command can take an audio wav file, specified using the -i or --input options, and transcribe it to the console. For large audio files this can take some time. Currently only wav files are supported through the command line (though the cloud service also supports mp3, ogg, and flac). In the following $ wget https://github.com/realpython/python-speech-recognition/raw/master/audio_file
$ ml transcribe azspeech --input=harvard.wav The stale smell of old beer lingers it takes heat to bring out the odor. A cold dip restore's health and Zest, a salt pickle taste fine with Ham tacos, Al Pastore are my favorite a zestful food is the hot cross bun.
To convert between file formats see the section on GNU/Linux Desktop Survival Guide. To save the output to a text file simply use the shell redirect operator > .
$ ml transcribe azspeech --input=harvard.wav > harvard.txt
$ cat harvard.txt The stale smell of old beer lingers it takes heat to bring out the odor. A cold dip restore's health and Zest, a salt pickle taste fine with Ham tacos, Al Pastore are my favorite a zestful food is the hot cross bun.
The input language will affect the AI’s capability and whilst it can automatically identify some languages, it can identify them all (at least not yet). We can assist by identifying the source language. In this example it is Indonesian. The first attempt results in a mix of English and some Indonesia. $ ml transcribe azspeech --input=indonews.wav Any luck a barbaric abair poker delapan waktu Indonesia parrot, cyano millionaire.
Knowing the language results in greater accuracy: $ ml transcribe azspeech --lang=id-ID --input=indonews.wav Inilah Kabar baru kabeer 8:00 waktu Indonesia Barat saya Naomi liandra.
The language code is the BCP-47 locale and supported codes are listed at https://docs.microsoft.com/en-gb/azure/cognitive-services/speech-service/language-support
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
15.6 azspeech transcribe pipelines 20210607
We can pipe the output from transcribe to other tools, so to, for example, analyse the
sentiment of the spoken word. In the first instance you might say happy days and in the second say sad days. $ ml transcribe azspeech | ml sentiment aztext 0.96
$ ml transcribe azspeech | ml sentiment aztext 0.07
Pipelines can become quite powerful. Indeed, a pipeline can exhibit AI that might appear to be more than just the sum of its parts. Here, it transcribes the audio from the microphone, which for me would be English, translates it to French, cuts the actual text, and synthesizes it in a French voice. $ ml transcribe azspeech | ml translate aztranslate --to=fr | cut -d',' -f4- | ml synthesize azspeech --voice=fr-FR-HortenseRUS
Voila
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
15.7 azspeech resources THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER MLHub Speech Services Documentation Supported Languages Python code for Speech Recognizer: Speech2Text Python code for Speech Synthesizer: Text2Speech
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16 Azure Text Analysis 20210610
Package: aztext.
Do you remember how you learnt to read? AI is learning to read using something called deep learning and analysing thousands of examples of text. After extensive training, often using a lot of computer power (and electricity), we can build a basic AI model. It’s difficult to say that it has any understanding, but it gives that appearance, and irrespective, is a useful tool. The aztext package demonstrates some of the capability of AI based text analysis. It does this using Microsoft’s Azure Text cloud service which makes a number of natural language processing (NLP) models available. See Section overview for details. This MLHub package makes most of the Azure Text Analytics functionality easily available for us to explore the capabilities of NLP and even to utilise it in our own tools. To install, configure, and demonstrate the package: ml install
aztext
ml configure aztext ml readme
aztext
ml commands
aztext
ml demo
aztext
In addition to the demo command the package also supports analyze, entities, language, links, phrases, sentiment, and supported. Azure-based models, unlike the MLHub models in general, use closed source services which have no guarantee of ongoing availability and do not come with the freedom to modify and share. This cloud based service also sends your text to the Azure cloud for analysis.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.1 aztext overview 20210610
A free Azure subscription allowing up to 5,000 text records per month (last checked at
Azure Pricing 20210701) is available from Microsoft at https://azure.microsoft.com/free/. A text record is 1,000 characters. After subscribing visit the Azure Portal at https://portal.azure.com and Create a resource under AI and Machine Learning called Text Analytics. Once created you can access the web API subscription key and endpoint from the portal. These will be prompted for when you configure the package. The credentials will be saved to file to reduce the need for repeated authentication requests. The first time you will be asked to enter the key and cloud endpoint: $ ml configure aztext Private information is required to access this service. See the README for more details.
Please paste your Text Analytics key: ************************* Please paste your endpoint: https://myaztext.cognitiveservices.azure.com/
That information has been saved into the file:
/home/kayon/.mlhub/aztext/private.json
If the file containing the private information already exists, you will be given a chance to display or edit the file, or else to load the private information from the file: $ ml configure aztext The following file has been found and is assumed to contain the private information.
/home/kayon/.mlhub/aztext/private.json
Use this private information ('d' to display, 'n' to update) [Y/d/n]?
The default response is Y (i.e., yes) but if you’d like to have a look at the key first then use d (i.e., display). To enter new information select n (i.e., respond no to loading the current private information).
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.2 aztext quick start THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER 20210610
There are quite a few commands supported by this package. Try some of them for
yourself. $ ml supported aztext $ ml analyze aztext Winter has set in and the days are short and cold. $ ml analyze aztext
这是⼀个⽤中⽂写的⽂件
$ ml entities aztext I had a wonderful trip to Seattle
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.3 aztext demo The pre-built demonstration highlights the capabilities of the package. ml demo aztext
Here is a sample of the interaction.
==================== Azure Text Analytics ====================
Welcome to a demo of the pre-built models for Text Analytics provided through Azure's Cognitive Services. This service extracts information from text that we supply to it, providing information such as the language, key phrases, sentiment (0-1 as negative to positive), and entities.
Press Enter to continue:
==================== Language Information ====================
We will first demonstrate the automated identification of language. Below are a few "documents" in different languages which are passed on to the
cloud for processing using the following language API URL:
Press Enter to continue:
1 Text as a sample document written in English. This is English (en) with score of 1.0.
2 Este es un document escrito en Español. This is Spanish (es) with score of 1.0.
...
================== Sentiment Analysis ==================
Now we look at an analysis of the sentiment of the document/text. This
is done so by passing the text of the text on to the sentiment API URL shown below for processing in the cloud. The results are returned as a number between 0 and 1 with 0 being the most negative and 1 being the most positive.
Press Enter to continue:
1 I had a wonderful experience! Rooms were wonderful and staff helpful. This has a sentiment rating of 0.97.
2 I had a terrible time at the hotel. The staff was rude and food awful. This has a sentiment rating of 0.00.
...
======== Entities ========
Our final demonstration identifies the entities refered to in the text. As a bonus the API generates a link to Wikipedia for more information! As
above, the text is passed on to the cloud through the
API at the URL below.
...
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.4 aztext supported 20210612
The supported command identifies what AI functionality is supported for each language.
The output consists of a row for each language with comma separated values (csv), and begins with the full language name, followed by the language code, and the support provided for sentiment analysis, identification of phrases, and extraction of entities. $ ml supported aztext [] -h
--header
Output a header line for the CSV.
language,code,sentiment,phrases,entities
The supported languages as of May 2021 are: $ ml supported aztext Chinese-Simplified,zh-hans,True,True,True Chinese-Traditional,zh-hant,True,True,False Dutch,nl,True,True,False English,en,True,True,False French,fr,True,True,False German,de,True,True,False Hindi,hi,True,True,False Italian,it,True,True,False Japanese,ja,True,True,False Korean,ko,True,True,False Norwegian
(Bokmål),no,True,True,False
Portuguese (Brazil),pt-BR,True,True,False Portuguese (Portugal),pt-PT,True,True,True Spanish,es,True,True,False Turkish,tr,True,True,False
To check if a specific language is supported:
$ ml supported aztext en English,en,True,True,True
$ ml supported aztext id
Use the --header command line option to list the header row which names the columns: $ ml supported aztext --header en language,code,sentiment,phrases,entity English,en,True,True,True
A pipeline can be constructed to test if a language is supported: $ LANG=id $ if test $(ml supported aztext ${LANG} | wc -l) -ne 0; then echo "${LANG} is supported"; else echo "${LANG} is not supported"; fi
id is not supported
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.5 aztext analyze 20210612
The analyze command performs a basic analysis across the four capabilities (language,
sentiment, entities, and phrases). $ ml analyze aztext []
conf,lang,sentiment,phrases,etities
The command takes a single sentence and returns the text analysis of the sentence, beginning with the confidence of the identification of the language, the language code, the sentiment (0 to 1 as negative to positive), the key phrases identified separated by colons, and the identified entities also separated by colons. $ ml analyze aztext I had a wonderful experience! The rooms were wonderful and staff 1.0,en,0.96,wonderful experience:rooms:staff helpful,
$ ml analyze aztext I had a trip to Singapore and enjoyed seeing the Botanic Gardens 1.0,en,0.92,Singapore:Botanic Gardens:trip,Location=Singapore:Location=Singapore Bot
$ ml analyze aztext Los caminos que llevan hasta Monte Rainier son espectaculares y 1.0,es,0.55,Monte Rainier:caminos,Location=Monte Rainier
$ ml analyze aztext La carretera estaba atascada. Había mucho tráfico el día de ayer 1.0,es,0.33,carretera:tráfico:día,
$ ml analyze aztext
这是⼀个⽤中⽂写的⽂件
1.0,zh_chs,0.75,,
$ ml analyze aztext Các bãi biển trên Phú Quốc là tuyệt vời. 1.0,vi,,,
Note that sentiments, key phrases, and entities are not supported for all languages. Refer to the supported command. The analyze command will also work without an argument whereby it will read text from standard input if it is part of a pipeline. $ cat sample.txt They’re annoying the hell out of people. Pour le logiciel libre, la liberté a un prix et un modèle économique I just toured Ecuador and Peru and came back addicted to plantains.
$ cat sample.txt | ml analyze aztext 1.0,en,0.06,hell:people, 1.0,fr,1.00,liberté:prix:logiciel libre:modèle économique,DateTime=a un:Quantity=et 1.0,en,0.70,Peru:toured Ecuador:plantains,Location=Ecuador:Location=Peru
If the analyze command is not part of a pipeline then it will enter an interactive loop, prompting for a sentence, and analyzing that sentence. $ ml analyze aztext Enter text to be analysed. Quit with Empty or Ctrl-d. (Output: conf,lang,sentiment,phrases,entities):
> La primera vez que escucho semejante palabra 1.0,es,0.52,semejante palabra,Quantity=primera
> I have returned to this country Australia to live in Melbourne 1.0,en,0.50,country Australia:Melbourne,Location=Australia:Location=Melbourne
>
Type Ctrl-D to finish.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.6 aztext entities THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER Named Entity Recognition The entities command identifies the entities from the text together with other information, including the type of entity and a Wikipedia link. For each entity identified the output consists of a single line reporting the entity name, the type of entity and sub-type, the confidence of the type of entity, the offset to the entity in the original text, then text length of the entity, the Wikipedia confidence, language, entity name, and URL. $ ml entities aztext I had a wonderful trip to Seattle last week and even visited th Seattle,Location,,0.82,26,7,0.24,en,Seattle,https://en.wikipedia.org/wiki/Seattle last week,DateTime,DateRange,0.80,34,9,,,, Space Needle,Location,,0.80,65,12,0.39,en,Space Needle,https://en.wikipedia.org/wiki Space Needle,Organization,,0.94,65,12,,,, 2,Quantity,Number,0.80,78,1,,,,
As part of a command line we could count the number of unique entities in the text: $ ml entities aztext I had a wonderful trip to Seattle last week and even visited th cut -d, -f1 | sort -u | wc -l 4
How many unique locations are identified in the text:
$ ml entities aztext I had a wonderful trip to Seattle last week and even visited th awk -F, '$2=="Location"{print}' | cut -d, -f1 | sort -u | wc -l 2
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.7 aztext phrases THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER The phrase command extracts the key phrases from the supplied text.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.8 aztext language THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.9 aztext sentiment THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER The sentiment command determines how positive the text is on a scale from 0 (negative) through 0.5 (neutral) to 1 (positive). $ ml sentiment aztext The weather here is cold and dreary 0.17
$ ml sentiment aztext had a great trip and all went really well 0.97
Pipeline: Extract Just Positive Utterances From File $ cat sample.txt | ml sentiment aztext | awk '$1>0.5{print NR}' | xargs -I % sed -n %p sample.txt
Pour le logiciel libre, la liberté a un prix et un modèle économique I just toured Ecuador and Peru and came back addicted to plantains.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 [email protected] Creative Commons Attribution-ShareAlike 4.0.
16.10 aztext links THIS SECTION IS UNDER DEVELOPMENT. PLEASE COME BACK LATER The links command will return the text as is but with entities that have Wikipedia pages marked up with HTML to link to the appropriate page. This is particularly useful in writing web pages within which you want to have links to Wikipedia. $ ml links aztext