What is a Data Source?
A data source is the location where data that is being used originates from.
A data source may be the initial location where data is born or where physical information is first digitized, however even the most refined data may serve as a source, as long as another process accesses and utilizes it. Concretely, a data source may be a database, a flat file, live measurements from physical devices, scraped web data, or any of the myriad static and streaming data services which abound across the internet.
Here’s an example of a data source in action. Imagine a fashion brand selling products online. To display whether an item is out of stock, the website gets information from an inventory database. In this case, the inventory tables are a data source, accessed by the web application which serves the website to customers.
Focusing on how the term is used in the familiar database management context will help to clarify what kinds of data sources exist, how they work, and when they are useful.
Data source nomenclature
Databases remain the most common data sources, as the primary stores for data in ubiquitous relational database management systems (RDBMS). In this context, an important concept is the Data Source Name (DSN). The DSN is defined within destination databases or applications as a pointer to the actual data, whether it exists locally or is found on a remote server (and whether in a single physical location or virtualized.) The DSN is not necessarily the same as the relevant database name or file name, rather it is in an address or label used to easily reach the data at its source.
Ultimately, the systems doing the ingesting (of data) determine the context for any discussion around data sources, so definitions and nomenclature vary widely and may be confusing. This is especially true in more technical documentation. For example, within the Java software platform, a ‘Datasource’ refers specifically to an object representing a connection to a database (like an extensible, programmatically packaged DSN). Meanwhile, some newer platforms use ‘DataSource’ more widely to mean any collection of data which provides a standardized means for access.
Data source types
Though the diversity of content, format, and location for data is only increasing with contributions from technologies such as IoT and the adoption of big data methodologies, it remains possible to classify most data sources into two broad categories: machine data sources and file date sources.
Though both share the same basic purpose — pointing to the data’s location and describing similar connection characteristics — machine and file data sources are stored, accessed, and used in different ways.
Machine data sources
Machine data sources have names defined by users, must reside on the machine that is ingesting data, and cannot be easily shared. Like other data sources, machine data sources provide all the information necessary to connect to data, such as relevant software drivers and a driver manager, but users need only ever refer to the DSN as shorthand to invoke the connection or query the data.
The connection information is stored in environment variables, database configuration options, or a location internal to the machine or application being used. An Oracle data source, for example, will contain a server location for accessing the remote DBMS, information about which drivers to use, the driver engine, and any other relevant parts of a typical connection string, such as system and user IDs and authentication.
File data sources
File data sources contain all of the connection information inside a single, shareable, computer file (typically with a .dsn extension). Users do not decide which name is assigned to file data sources, as these sources are not registered to individual applications, systems, or users, and in fact do not have a DSN like that of machine data sources. Each file stores a connection string for a single data source.
File data sources, unlike machine sources, are editable and copyable like any other computer file. This allows users and systems to share a common connection (by moving the data source between individual machines or servers), and for the streamlining of data connection processes (for example by keeping a source file on a shared resource so it may be used simultaneously by multiple applications and users).
It is important to note that ‘unshareable’ .dsn files also exist. These are the same type of file as described above, but they exist on a single machine and cannot be moved or copied. These files point directly to machine data sources. This means that unshareable file data sources are wrappers for machine data sources, serving as a proxy for applications which expect only files but also need to connect to machine data.
How data sources work
Data sources are used in a variety of ways. Data can be transported thanks to diverse network protocols, such as the well-known File Transfer Protocol (FTP) and HyperText Transfer Protocol (HTTP), or any of the myriad Application Programming Interfaces (APIs) provided by websites, networked applications, and other services.
Many platforms use data sources with FTP addresses to specify the location of data needed to be imported. For example, in the Adobe Analytics platform, a file data source is uploaded to a server using an FTP client, then a service utilizes this source to move and process the relevant data automatically.
SFTP (The S stands for Secure or SSH) is used when usernames and passwords need to be obfuscated and content encrypted, or FTPS may alternatively be used by adding Transport Layer Security (TLS) to FTP, achieving the same goal.
Meanwhile, many and diverse APIs are now provided to manage data sources and how they are used in applications. APIs are used to programmatically link applications to data sources, and typically provide more customization and a more versatile collection of access methods. For example, Spark provides an API with abstract implementations for representing and connecting to data sources, from barebones but extensible classes for generic relational sources, to detailed implementations for hard-coded JDBC connections.
Other protocols for moving data from sources to destinations, especially on the web, include NFS, SMB, SOAP, REST, and WebDAV. These protocols are often used within APIs (and some APIs themselves make use of other APIs internally), within fully featured data applications, or as standalone transfer processes. Each have characteristic features and security concerns which should be considered for any data transfer.
The purpose of a data source
Ultimately, data sources are intended to help users and applications connect to and move data to where it needs to be. They gather relevant technical information in one place and hide it so data consumers can focus on processing and identify how to best utilize their data.
The purpose here is to package connection information in a more easily understood and user-friendly format. This makes data sources critical for more easily integrating disparate systems, as they save shareholders from the need to deal with and troubleshoot complex but low-level connection information.
And although this connection information is hidden, it is always accessible when necessary. Additionally, this information is stored in consistent locations and formats which can ease other processes such as migrations or planned system structural changes.
Getting started with data sources and integration
Once data has arrived at its final destination, preferably a centralized repository such as a cloud data warehouse, differences in formatting or structure based on the source should be smoothed out. The very first step towards this data integration goal, however, involves abstracting the initial data connections themselves — a complex task when accounting for the number of data sources accessible via the cloud.
Talend helps customers integrate data from thousands of internal and cloud-based sources, speeding up the journey from unmanageable, disparate systems, to a unified view of trusted enterprise data. Using a single suite of apps focused on data integrity and data integration, Talend Data Fabric improves and secures your data value-chain, from the very initial connection to a data source to effective analytics and business intelligence.
Try Talend Data Fabric today to seamlessly integrate to your data sources and gain insights from data you can trust.
Ready to get started with Talend?
More related articles
- What is MySQL? Everything You Need to Know
- What is Middleware? Technology’s Go-to Middleman
- What is Shadow IT? Definition, Risks, and Examples
- What is Serverless Architecture?
- What is SAP?
- What is ERP and Why Do You Need It?
- What is “The Data Vault” and why do we need it?
- What is a Data Lab?
- Understanding Cloud Storage
- What is a Legacy System?
- What is Data as a Service?
- What is a Data Mart?
- What is Data Processing?
- Understanding data mining: Techniques, applications, and tools
- What is Apache Hive?
- Data Munging: A Process Overview in Python
- Data Transformation Defined
- SQL vs NoSQL: Differences, Databases, and Decisions
- Data Modeling: Ensuring Data You Can Trust
- How modern data architecture drives real business results
- Data Gravity: What it Means for Your Data
- CRM Database: What it is and How to Make the Most of Yours
- Data Conversion 101: Improving Database Accuracy