Oracle

Approximate Count Distinct in Oracle Database 12c


Approximate Count Distinct in Oracle Database 12c


By Yuli Vasiliev

Counting distinct values in a dataset is a commonly used operation in data analysis. In an Oracle Database, you can do this using function COUNT with the DISTINCT clause. Starting with Oracle Database 12cR1 (12.1.0.2), you can also use function APPROX_COUNT_DISTINCT, which – unlike COUNT with DISTINCT returning the exact result – returns the approximate number of rows containing distinct values of the argument expression.

APPROX_COUNT_DISTINCT is designed to operate on large amounts of data, processing them much faster than COUNT. The article provides some examples of using APPROX_COUNT_DISTINCT, comparing the results with those generated with the COUNT DISTINCT couple.

In one of my preceding articles Loading JSON Through External Tables, which can be found at https://sswug.org/yulivasiliev/oracle/loading-json-through-external-tables/, I discussed how to load external JSON data to an Oracle database through an external table, using, as a JSON data source, the PurchaseOrders.dmp file that comes with Oracle Database 12c as a sample and contains 10000 PO documents in JSON format. The queries provided below assume you have created the database objects as described in that article. In particular, you will need to have the ext_json external table created in your Oracle database.

First, let’s simply count the rows in the ext_json table. This can be done, for example, as follows:

SELECT COUNT (a.JSON_DOC.Reference) from ext_json a;

Expectedly, producing the following result:

10000

Let’s now try to exclude the rows with repeating references. With the DISTINCT clause of COUNT, this can be done as follows:

SELECT COUNT (DISTINCT a.JSON_DOC.Reference) from ext_json a;

This time you should see something like this:

8861

And what the APPROX_COUNT_DISTINCT function generates when it takes the same argument:

SELECT APPROX_COUNT_DISTINCT(a.JSON_DOC.Reference) from ext_json a

As you can see, the numberof rows shown by APPROX_COUNT_DISTINCT is a bit different:

9274

Complicating the query, let’s try to count the number of line items included in all the orders. This can be done with the following query:

SELECT count(t.Description)

FROM ext_json j,

JSON_TABLE(j.json_doc, ‘$’

COLUMNS (Reference VARCHAR2(50 CHAR) PATH ‘$.Reference’,

NESTED PATH ‘$.LineItems[*]’

COLUMNS (

ItemNumber NUMBER(3) PATH ‘$.ItemNumber’,

Description VARCHAR2(250 CHAR) PATH ‘$.Part.Description’

))) t;

The result should look like this:

45260

However, when used with DISTINCT:

SELECT count(DISTINCT t.Description)

FROM ext_json j,

JSON_TABLE(j.json_doc, ‘$’

COLUMNS (Reference VARCHAR2(50 CHAR) PATH ‘$.Reference’,

NESTED PATH ‘$.LineItems[*]’

COLUMNS (

ItemNumber NUMBER(3) PATH ‘$.ItemNumber’,

Description VARCHAR2(250 CHAR) PATH ‘$.Part.Description’

))) t;

COUNT should give you something like this:

5166

Finally, let’s look at what APPROX_COUNT_DISTINCT will generate when applied to the same query:

SELECT APPROX_COUNT_DISTINCT(t.Description)

FROM ext_json j,

JSON_TABLE(j.json_doc, ‘$’

COLUMNS (Reference VARCHAR2(50 CHAR) PATH ‘$.Reference’,

NESTED PATH ‘$.LineItems[*]’

COLUMNS (

ItemNumber NUMBER(3) PATH ‘$.ItemNumber’,

Description VARCHAR2(250 CHAR) PATH ‘$.Part.Description’

))) t;

As you can see, the result number is very similar:

5163

Conclusion

As you learned in this article, new Oracle Database 12CR1 function APPROX_COUNT_DISTINCT can be used as an alternative to COUNT with DISTINCT, producing similar results. Oracle recommends using APPROX_COUNT_DISTINCT when it comes to counting large amounts of data.

About the Author

Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, Java technologies, business intelligence (BI), databases, and service-oriented architecture (SOA). He is the author of Oracle Business Intelligence: The Condensed Guide to Analysis and Reporting (Packt, 2010) and PHP Oracle Web Development (Packt, 2007) as well as a series of other books on the Oracle technology.