As I mentioned in my last column, the pandemic has shaken up the technology workforce at most employers, both those in IT departments and the many tech professionals who work in lines of business, corporate departments, and product groups, among others. While some industries are struggling to adapt their products and services to a world ravaged by a particularly virulent virus, others are taking advantage of it.
This is also true about technology innovation. Some, like advanced data analytics and the internet of things (IoT)—or to be more accurate, the internet of everything, or IoE—continue to build momentum despite the current economic and social upheaval.
This column will address the job and career opportunities for those aspiring to work or build a career in advanced data analytics (aka big data). First, we'll look at drivers, then hot jobs, and finish with skills and certifications that are displaying strong numbers in our quantitative and qualitative tech labor benchmark research and can recommend them as winners on this basis.
First, here's why you should care
There will be 31 billion connected things by the end of 2020 by some estimates and as many as 75 billion by 2025 according to data collected prior to the COVID pandemic. Moreover, global spending on the IoT/IoE is estimated to reach 1.29 trillion this year with the global IoT device market reaching $1.1 trillion by 2026. And then there's this eye-opening finding: in 2018 an article in Forbes magazine claimed that more than 90 percent of internet data was generated over the prior two years…and this is expected to grow by more than 5 times by 2025. That's based on 5 quintillion bytes daily according to Cisco.
The effect of this explosion in sensors and devices on big data analytics should be obvious: it's grown from data 'lakes' to 'oceans' and now to data 'galaxies'. Perhaps not as obvious are these drivers and some of the skills in demand for each:
- Proliferation of deep learning frameworks like TensorFlow, Caffe, Keras, PyTorch, and MXnet as companies accelerate monetization of vast data sets. Skills: Neural network algorithms, ASICS/TPUs/FPGAs.
- Merging of Artificial Intelligence and IoT (AIoT) to form a smart, seamlessly connected network of devices over powerful 5G networks promises to transform how we interact with our homes, offices, and cities very soon. Major AIoT segments include: wearables; smart home, smart city, and smart industry.
- Shift from ad hoc analytics use cases to operationalizing production quality big data pipelines.
- Rise of real-time streaming analytics, with hot skill areas including: NewSQL databases; in-memory data grids; dedicated streaming analytic platforms converging to enable ultra-fast processing of streaming analytics; open source streaming frameworks like Kafka, Spark, and Flink enabling SQL capabilities.
- Merging of BI/Analytics, data science, and data engineering teams and skill sets.
- Rising demand for workers experienced with supervised algorithms and unsupervised learning, effective in identifying anomalous behavior and triggering reduced or restricted access.
- Rise in cloud-based and containerized identity and access management services (13% CAGR next six years, to $24 billion).
Hot jobs to get you there
There are many hot jobs to be found in the Big Data/IoT space. Here are a few, and why.
Big Data Development Engineer
There are many applications for big data across lots of industries and the demand for skilled, big data development engineers is growing. Due to the massive amounts of data, it has become more and more difficult to manipulate and analyze data and ultimately extract value information from it. No matter the level of technical depth or development, demand for this position will continue to increase in the future. There are several hard skills required: SQL, programming, exploratory analysis skills, Hadoop/parallel processing, machine learning and data mining. As for the soft skills, the ability to model, optimize and simulate have gained much popularity recently. Big data development engineers should be willing to constantly upgrade skills and accumulate practical experience.
Database Administrator
Many projects need database support and this position is engaged in the management, maintenance, and security of database systems. Among other duties, they install, back up, update and patch databases, as well as ensure database access, completeness, and coherence. It is a critical role because the loss of sensitive information could be catastrophic for companies and organizations. Key skills needed for the job include fluency in SQL, UNIX, databases such as Oracle database, MySQL, and PostgreSQL, and maybe a relevant certification.
Data Analyst
The responsibilities of this position include developing frameworks for data, analytics, and strategy development; implementing data-analysis tools; collecting and analyzing data sets from diverse sources to inform business decisions and make accurate predictions; tracking and monitoring internal and external data; and providing user training. The best data analysts right now are using machine learning and predictive models to find new ways to analyze data. Key skills needed for the job include SQL querying; database construction; strong statistical abilities; R or Python; an ability to analyze large data sets and filter relevant data sets; an analytical mind with problem-solving ability; experience in data modeling and reporting software; attention to detail; and the ability to write actionable reports in clear language.
Data Engineer
Incumbents in this position build systems to handle big data; design, develop, build, test, and maintain architectures, including databases and large-scale data-processing systems; find ways to acquire and filter data; develop high-performance algorithms for data use, such as predictive modeling and proof of concepts; and create and implement disaster-recovery plans. Key skills needed for the job include: Knowledge of Hadoop-based technologies, SQL-based technologies, NoSQL technologies, data-modeling tools, and various coding languages including Python, C/C++ or Java, Perl; statistical analysis and modeling; predictive modeling; neuro-linguistic programming, machine learning, and text analysis experience.
Data Scientist
The responsibilities of this position include gathering, cleaning, managing, and exploring a large amount of disparate data in order to make predictions; build data models and algorithms; test hypotheses; and communicate the results. They generate evidence-based insights that can be communicated in a visual and storytelling fashion in order to aid the business in decision making. They are highly skilled in modeling complex problems, discovering insights, and identifying opportunities blending a variety of statistical, mining and visualization techniques with statistical modeling packages typically in a large-scale, distributed data system. The best data scientists are inquisitive, creative, adaptable, and tenacious and have a passion for algorithms and an excellent foundation in probability theory, estimation/classification methods, and a solid understanding of machine learning/data mining concepts. They need a solid understanding of Big Data platforms, frameworks and programming models using Hadoop, MapReduce, Hive, Spark, and fluency in SQL, NoSQL, Pandas, Pig, and the like.
Hot big data/IoT skills for 2020-21
The following non-certified big data- and IoT related tech skills are among the highest paying in our long-running IT Skills and Certifications Pay IndexTM of data received from 3,602 employers in the U.S. and Canada. Many are still rising in market value. These are two factors that should certainly be prioritized by tech professionals looking to boost both their compensation and/or their attractiveness to potential employers.
Amazon Athena
Average pay premium: 18 percent of base salary equivalent
Market value increase: 12.5 percent (in the six months through July 1, 2020)
Data is in such great abundance that the answers companies need from their data can sometimes be elusive. But the tools to analyze and process that data are not always easy to use, overly accessible, or even that effective. The problem: data has to reside somewhere, and most companies have to think about how it is stored, who will access it, how to make it secure, and most importantly how to make data access reliable and fast. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It's serverless in that you don't have to manage the infrastructure or use database software to manage it. Plus, it's fast so SQL queries can be run and results expected on large datasets in a matter of seconds. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing users to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.
RStudio
Average pay premium: 17 percent of base salary equivalent
Market value increase: 21.4 percent (in the six months through July 1, 2020)
RStudio is an integrated development environment for R, a programming language for statistical computing and graphics, and for Python. It is available in two formats, RStudio Desktop and web browser-accessible RStudio Server running on a remote server. RStudio is partly written in the C++ programming language and uses the Qt framework for its graphical user interface, however a bigger percentage of the code is written in Java and JavaScript. The keys for RStudio's popularity for analyzing data in R include:
- R is open source. It's free which is an advantage against paying for MATLAB or SAS licenses. This is also important if you're working with global teams in areas where software is expensive of in inaccessible. It also means that R is actively developed by a community and there are regular updates
- R is widely used. R is used in many subject areas (not just bioinformatics) making it more likely for finding help online when it's needed.
- R is powerful. R runs on multiple platforms (Windows/MacOS/Linux). It can work with much larger datasets than popular spreadsheet programs like Microsoft Excel, and because of its scripting capabilities it is more reproducible. There are thousands of available software packages for science, including genomics and other areas of life science.
Master data management
Average pay premium: 17 percent of base salary equivalent
Market value increase: 6.3 percent (in the six months through July 1, 2020)
Master data management (MDM) arose out of the necessity for businesses to improve the consistency and quality of their key data assets, such as product data, asset data, customer data, location data, etc. Many businesses today, especially global enterprises, have hundreds of separate applications and systems where data that crosses organizational departments or divisions can easily become fragmented, duplicated and most commonly out of date. When this occurs, accurately answering even the most basic but critical questions about any type of performance metric or KPI for a business becomes hard. The basic need for accurate, timely information is acute and as sources of data increase, managing it consistently and keeping data definitions up to date so all parts of a business use the same information is a never-ending challenge. That's what has and will continue to drive a premium on MDM skills.
Cloudera Impala
Average pay premium: 16 percent of base salary equivalent
Market value increase: 14.3 percent (in the six months through July 1, 2020)
Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. The fast response for queries enables interactive exploration and fine-tuning of analytic queries rather than long batch jobs traditionally associated with SQL-on-Hadoop technologies, meaning that data can be stored, shared, and accessed using various solutions that avoids data silos and minimizes expensive data movement. Impala returns results typically within seconds or a few minutes, rather than the many minutes or hours that are often required for Hive queries to complete. We cannot understate the value of this to advanced data analytics platforms and the work of data scientists and analysts engaged in Big Data initiatives and the impact this has on skills acquisition demand going forward.
Apache Cassandra
Average pay premium: 16 percent of base salary equivalent
Market value increase: 6.7 percent (in the six months through July 1, 2020)
Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It offers robust support for clusters spanning multiple datacenters, with asynchronous master less replication allowing low latency operations for all clients. Cassandra offers the distribution design of Amazon Dynamo with the data model of Google's Bigtable. It is a database for applications requiring the highest levels of reliability, scalability, and performance.
Data science
Scala
Average pay premium: 16 percent of base salary equivalent
Market value increase: 6.7 percent (in the 12 months through July 1, 2020)
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: using the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems. Data science continues to evolve as one of the most promising and in-demand career paths for skilled professionals. Today, successful data professionals understand that they must advance past the traditional skills of analyzing large amounts of data, data mining, and programming skills. In order to uncover useful intelligence for their organizations, data scientists must master the full spectrum of the data science life cycle and possess a level of flexibility and understanding to maximize returns at each phase of the process.
The Scala programming language—short for 'scalable'—makes up for a lot of deficiencies in Java, integrating with Java while optimizing code to work with concurrency. It appeals most to enterprises that have already invested in Java and don't want to have to support anything new in their production environments.
[Tie] Data analytics
Google TensorFlow
Predictive Analytics and Modeling
Average pay premium: 16 percent of base salary equivalent
Market value increase:6.7 percent (in the six months through July 1, 2020)
Data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption. Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass of information. This information can then be used to optimize processes to increase the overall efficiency of a business or system.
TensorFlow is a popular open-source deep learning library developed at Google, which uses machine learning in all of its products to take advantage of their massive datasets and improving the search engine, translation, image captioning and recommendations. TensorFlow is also used for machine learning applications such as neural networks. Its flexible architecture allows for the easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. TensorFlow provides stable Python and C APIs without API backwards compatibility guarantees for C++, Go, Java, JavaScript and Swift. Third-party packages are available for C#, Haskell, Julia, R, Scala, Rust, OCaml and Crystal.
Python has always been the choice for TensorFlow due to the language being extremely easy to use and having a rich ecosystem for data science including tools such as Numpy, Scikit-learn, and Pandas.
Predictive Analytics and Modeling is a process that uses data and statistics to predict outcomes with data models. These models can be used to predict anything from sports outcomes and TV ratings to technological advances and corporate earnings. Predictive modeling is also often referred to as:
- Predictive analytics
- Predictive analysis
- Machine learning
These synonyms are often used interchangeably. However, predictive analytics most often refers to commercial applications of predictive modeling, while predictive modeling is used more generally or academically. Of the terms, predictive modeling is used more frequently. Machine learning is also distinct from predictive modeling and is defined as the use of statistical techniques to allow a computer to construct predictive models. In practice, machine learning and predictive modeling are often used interchangeably. However, machine learning is a branch of artificial intelligence, which refers to intelligence displayed by machines.
Predictive modeling is useful because it gives accurate insight into any question and allows users to create forecasts. To maintain a competitive advantage, it is critical to have insight into future events and outcomes that challenge key assumptions.
Analytics professionals often use data from the following sources to feed predictive models:
- Transaction data
- CRM data
- Customer service data
- Survey or polling data
- Digital marketing and advertising data
- Economic data
- Demographic data
- Machine-generated data (for example, telemetric data or data from sensors)
- Geographical data
- Web traffic data
Hot big data/IoT certifications for 2020-21
Using similar criteria as non-certified skills above, which big data-related certifications are paying above-average cash pay premiums? Certifications have been declining in value overall for a few years but some in this category are still gaining in value, which is noted below. To the extent that employers place value on certifications in hiring, promoting, and retaining workers that comprise their big data labor force, consider the follow certifications to be winners.
1. SAS® Certified Advanced Analytics Professional Using SAS®9
Average pay premium: 10 percent of base salary equivalent
Market value increase: no change(in the six months through July 1, 2020)
2. SAS® Certified Data Scientist
Average pay premium: 10 percent of base salary equivalent
Market value decrease: -16.7 percent (in the six months through July 1, 2020)
3. Oracle Certified Expert - MySQL 5.1 Cluster Database Administrator
Average pay premium: 9 percent of base salary equivalent
Market value increase: 16.7 percent (in the six months through July 1, 2020)
4. Teradata 14 Certified Master
Average pay premium: 9 percent of base salary equivalent
Market value increase: no change (in the six months through July 1, 2020)
5. Cloudera Certified Associate Spark and Hadoop Developer
Average pay premium: 9 percent of base salary equivalent
Market value decrease: -10 percent (in the six months through July 1, 2020)
6. Cloudera Certified Associate Data Analyst
Average pay premium: 9 percent of base salary equivalent
Market value decrease: -18.2 percent (in the six months through July 1, 2020)
7. SAS® Certified Data Integration Developer for SAS®9
Average pay premium: 8 percent of base salary equivalent
Market value increase: 14.3 percent (in the six months through July 1, 2020)
8. [Tie] Certified Analytics Professional (CAP)
Certified Data Management Professional (CDMP)
IBM Certified Database Administrator - DB2
IBM Certified Solution Developer - DB2 SQL
MongoDB Certified DBA
Teradata 14 Certified Database Administrator
Teradata 14 Certified Enterprise Architect
Teradata 14 Certified Solutions Developer
Average pay premium: 8 percent of base salary equivalent
Market value increase: no change(in the six months through July 1, 2020)
9. SAS® Certified Big Data Professional Using SAS®9
Average pay premium: 8 percent of base salary equivalent
Market value decrease: -11.1 percent(in the six months through July 1, 2020)
10. EMC Data Science Specialist, Advanced Analytics
Average pay premium: 8 percent of base salary equivalent
Market value decrease: -20 percent (in the six months through July 1, 2020)