Text analytics reveal thirty two percent of comments on hive are not unique and at least ten percent add no value to discussion

holoz0r (75)in Hive Statistics • 19 days ago

In a past, professional life, I once did a task that examined the quality and occurrence of text. No on asked for it. I was a Business Analyst who kept seeing the same comments come up, and I was concerned that poor quality notes were being left on customer accounts.

I was trying to ascertain how good complaint resolution notes left by on customer cases were based on their length, uniqueness, and frequency. Now that I find myself temporarily unemployed, I thought it would be fun (if you can call data fun - I do) to create a study on HIVE comments, and to do some objective analysis on the comments left on HIVE.

Because I am feeling lazy for this analysis, I am using power query and Excel, so I'll include the step by step methodology as I go.

Firstly, some parameters about the data used:

Extracted from HIVE SQL, I am looking at the Comments table.

= DBHive{[Schema="dbo",Item="Comments"]}[Data]

I am then looking only for a week worth of content

= Table.SelectRows(dbo_Comments, each [created] >= #datetime(2025, 5, 18, 0, 0, 0) and [created] <= #datetime(2025, 5, 24, 0, 0, 0))

I am interested only in comments, not top level posts. Therefore I am filtering OUT content that does not have a parent author. I'm also keeping everything with a "blank" title, as this appears to get me actual comments.

= Table.SelectRows(#"Filtered Rows", each ([parent_author] <> null and [parent_author] <> "") and ([title] = ""))

This leaves me with 98,655 comments to work with as a sample set, looking at a period of a week. The first thing I want to check the integrity of the data, and given that I know my own data best, let me test to see what I've been doing and who I've been talking to most on the blockchain in the last week:

holoz0r replies to user	this many times
riverflows	16
galenkp	8
cryptoandcoffee	4
jorgebgt	3
creativemary	3
abh12345	3
meno	3
hivewatchers	3
azircon	3
fastchrisuk	2
mattclarke	2
acidyo	2
steevc	2
menati	2
beatminister	2
raceline	2
buggedout	2
vatman	2
edicted	2
vimukthi	2

Looks about right, given that I know my activity.

So my next step is to figure out which account did the most replies in the sampled period. (because, as we all should know by now, not every account is a human, and it is pretty obvious on the basis of some of the account names that appear in the list.

The next thing I want to learn about is users who are not me, because they are typically more interesting than myself. The thing I love about data is that data hides absolutely nothing, and we can see that there is a lot of bots or tokens...

User making comment	count of comments
hivebuzz	3634
lolzbot	2000
actifit	988
worldmappin	940
luvshares	822
beerlover	700
splinterboost	621
pizzabot	616
ladytoken	596
bpcvoter1	452
roswelborges	448
aquarius.academy	448
chi4god	442
hug.bot	435
hivebits	418
u89gw	415
xcv47	413
w7ngc	412
jkl65	411
w95hj	409
sor31	409
hk14d	407
fgh87	407
asd09	407
f76wz	405
vmn31	404
dw38h	404
wiv01	403
x6oc5	402
zxc43	401

What I am interested in next is probably a futile exercise, but I want to know what the most commonly left ... comment is and what percentage that IDENTICAL comment makes up of all the comments left during the week.

I am pleased to report that this simple analysis reveals that:

Over 10% of the comments left on HIVE comments are entirely meaningless

Data doesn't lie. Here are the top 100 most commonly left comments.

Furthermore, once I exclude non-duplicate comments, we find that 32,068 of the comments left on HIVE for the week are non-unique. Therefore, from our original sample of 98,655 comments, a whopping 32.5% of comments left on the HIVE blockchain are NOT UNIQUE!

This means, on aggregate, for every comment that you see on HIVE, about one in three will be the same. Context is important though, therefore we've got to consider common phrases that appear at the top of the list:

When I look through the duplicate comments, I can see that we're a grateful bunch, with the string "thank" appearing in 12,861 comments, or 13% of replies.

I plan on interrogating this data in more depth, but I think this is a good starting point to build a future "dashboard" of comment health on HIVE.

What would you like to see in such a dashboard?

My thoughts are as follows: