Category: Technology

July 17, 2023

Counting Messages in All Kafka Topics

For some reason, I am given a lot of Kafka instances that no one knows what they are or what they do. The first step, generally, is figuring out if it does anything. Because a server that no one has sent a message to in a year or two … well, there’s not much point in bringing it up to standard, monitoring it, and such. My first glance analysis has been just counting all of the messages in all of the topics to see which topics are actually used — quick bash script to accomplish this (presuming a Kafka broker is on port 9092 of the host running the script)

strTopics=$(./kafka-topics.sh --list --bootstrap-server $(hostname):9092)

SAVEIFS=$IFS   
IFS=$'\n'      
arrayTopics=($strTopics)
IFS=$SAVEIFS   

for i in "${arrayTopics[@]}"; do iMessages=`./kafka-console-consumer.sh --bootstrap-server $(hostname):9092 --topic $i --property print.timestamp=true --from-beginning --timeout-ms=10000 2>&1 | grep "Processed a total of"`;         echo "$i     $iMessages"; done

July 12, 2023

NEO4J: Exploring the Data

Since I seem to frequently acquire orphaned platforms with no documentation, I figure it would be good to figure out how to investigate an unknown NEO4J platform to see what it’s got. “SHOW” is very useful in these cases. The full list of SHOW commands is:

"ALIAS"
"ALIASES"
"ALL"
"BTREE"
"BUILT"
"CONSTRAINT"
"CONSTRAINTS"
"CURRENT"
"DATABASE"
"DATABASES"
"DEFAULT"
"EXIST"
"EXISTENCE"
"EXISTS"
"FULLTEXT"
"FUNCTION"
"FUNCTIONS"
"HOME"
"INDEX"
"INDEXES"
"KEY"
"LOOKUP"
"NODE"
"POINT"
"POPULATED"
"PRIVILEGE"
"PRIVILEGES"
"PROCEDURE"
"PROCEDURES"
"PROPERTY"
"RANGE"
"REL"
"RELATIONSHIP"
"ROLE"
"ROLES"
"SERVER"
"SERVERS"
"SETTING"
"SETTINGS"
"SUPPORTED"
"TEXT"
"TRANSACTION"
"TRANSACTIONS"
"UNIQUE"
"UNIQUENESS"
"USER"
"USERS"

The most useful ones for figuring out what you’ve got … the “show databases” command I know from MySQL/MariaDB does what I expected – you can also include a specific database name, but “show database ljrtest” doesn’t appear to list any more information than the generic show databases command. .

There’s also a “show users” command that outputs the users in the database – although I’m using the community edition without authentication, so there isn’t much interesting information being output here.

And roles, if roles are being used, should be output with “SHOW ALL ROLES” … but mine just says “Unsupported administration command”

Once you know what databases you’ve got and who can log in and do stuff, we’d want to look at the data. There are some built-in functions that will help us out here. The db.labels() function will list the labels in the selected database.

You can also return a distinct list of labels along with the count of nodes with that label:

Since a node can have multiple, comparing that total with a count of nodes would give you an idea if there are many more labels than nodes. In my case, either view shows 156 … so I know there are few (if any) nodes with multiple labels.

To view the types of relationships defined in the selected database, use “CALL db.relationshipTypes()”

Similarly, you can return the relationship types along with counts

There is a function to list the property keys used within the data – interesting to note that keys that were used but the nodes using them were subsequently deleted … they still show up as property keys. Basically, anything that is there is on this list but some things on this list may not be there anymore. In this example, ‘parentm’ and ‘parentf’ were labels I used to build relationships programmatically.

I’ve found db.schema.nodeTypeProperties to be more useful in this regard – it does not appear to list properties that are not in use, and the output includes field types

To see if there are any custom procedures or functions registered, look on the server(s). Use ps -efww to view the running command – there will be some folders listed after “-cp” … you could find procedures or plugins in any of those folders. In my case, the plugins are in /plugins

And the only “custom” things registered are the APOC and APOC-Extended jar’s

July 11, 2023

Tableau PostgreSQL Query: Finding All Datasources of Name or Type

I frequently need to find details on a data source based on its name and find all data sources of a particular type. Particularly, the Microsoft Graph permissions required to use Sharepoint and OneDrive data within Tableau changed — I needed to reach out to individuals who use those data types to build a business case for the Security organization to approve the new permissions be added to our tenant.

-- Query to find all data sources of a specific type or name 
select system_users.email, datasources.id, datasources.name, datasources.created_at, datasources.updated_at, datasources.db_class, datasources.db_name
, datasources.site_id, sites.name as SiteName, projects.name as ProjectName, workbooks.name as WorkbookName
from datasources
left outer join users on users.id = datasources.owner_id
left outer join system_users on users.system_user_id = system_users.id
left outer join sites on datasources.site_id = sites.id
left outer join projects on datasources.project_id = projects.id
left outer join workbooks on datasources.parent_workbook_id = workbooks.id
-- where datasources.name like '%Sheet1 (LJR Sample%'
where datasources.db_class = 'onedrive'
order by datasources.name
;

July 10, 2023

Neo4J – The Importance of the Data Model

While I am certain table-based SQL databases required planning to establish a reasonable data model – optimizing storage, defining foreign keys, indexing … I have found it more challenging to create a good data model in Neo4j. Maybe that’s because I normally populate SQL tables with custom scripts that can be modified to handle all sorts of edge cases. Maybe I’m still thinking in tables, but there seems to be more trial and error in defining the data model than I’ve ever had in SQL databases.

In the import-from-html-table example, a candidate often is associated with multiple elections. Storing candidates as nodes and elections as other nodes that contain results (electoral college votes for winner & loser and popular votes for winner & loser) then associating candidates with elections allowed me to store data about US elections in the Graph. I know who ran, who won and who lost, and what the results were for each election.

Associating the results with candidates didn’t work because Franklin D Roosevelt only has one property for “EC_VOTES” … which election does that reflect? I could also have added the vote totals to the relationship but that would either separate the data (the loser’s votes are stored on LOST relationships and the winner’s are stored on WON relationships) or data duplication (both WON and LOST relationships contain the same vote numbers).

Query used to populate the data:

CALL apoc.load.html("https://www.iweblists.com/us/government/PresidentialElectionResults.html",
{electionyear: "#listtable tbody tr td:eq(0)"
, winner: "#listtable tbody tr td:eq(1)"
, loser: "#listtable tbody tr td:eq(2)"
, electoral_win: "#listtable tbody tr td:eq(3)"
, electoral_lose: "#listtable tbody tr td:eq(4)"
, popular_win: "#listtable tbody tr td:eq(5)"
, popular_delta: "#listtable tbody tr td:eq(6)" }) yield value
WITH value, size(value.electionyear) as rangeup

UNWIND range(0,rangeup) as i WITH value.electionyear[i].text as ElectionYear
, value.winner[i].text as Winner, value.loser[i].text as Loser
, value.electoral_win[i].text as EC_Winner, value.electoral_lose[i].text as EC_Loser
, value.popular_win[i].text as Pop_Vote_Winner
, value.popular_delta[i].text as Pop_Vote_Margin

MERGE (election:Election {year: coalesce(ElectionYear,"Unknown")})
SET election.EC_Votes_Winner = coalesce(EC_Winner,"Unknown")
SET election.EC_Votes_Loser = coalesce(EC_Loser,"Unknown")
SET election.Pop_Votes_Winner = apoc.text.replace(Pop_Vote_Winner, ",", "")
SET election.Pop_Votes_Loser = apoc.number.exact.sub(apoc.text.replace(Pop_Vote_Winner, ",", ""), apoc.text.replace(Pop_Vote_Margin, ",", ""))

MERGE (ew:CANDIDATE {name: coalesce(Winner,"Unknown")})
MERGE (el:CANDIDATE {name: coalesce(Loser,"Unknown")})

MERGE (ew)-[:WON]->(election) MERGE (el)-[:LOST]->(election);

July 7, 2023

NEO4J: More Cypher Queries

To get a count of returned records, Cypher uses COUNT pretty much the same way as SQL does

Interestingly, there are other aggregation functions that remind me of using the ELK API — I can get averages, min/max, and standard deviation.

Chaining MATCH statements functions similarly to a SQL JOIN — get the items with this label, add in some other stuff. And, just like an INNER JOIN, this means no data is returned when one of the conditions has no matches — Bill Clinton never lost an election, so we get a null data set here:

The equivalent of an outer join is an OPTIONAL MATCH — here, the records from the first MATCH will be returned even if there is no corresponding record matching the second MATCH

ORDER BY also works in the same way it does in SQL. Multiple order parameters are separated by a comma and add DESC to do a DESCENDING ORDER

WHERE can be used to create the equivalent of a LIKE query — the where =~ operator uses regular expression syntax, so you don’t just use % or * as a wildcard. Regex wildcards like .* (match any character zero or more times) are used.

July 6, 2023

Tableau PostgreSQL Query: Stats About Data Source Types

I wanted to report on the different types of data sources used in our Tableau instance — as well as show how many of each type are in use.

-- Query to found how many of each data source type
select datasources.db_class, count(datasources.db_class) as count
from datasources
left outer join users on users.id = datasources.owner_id
left outer join system_users on users.system_user_id = system_users.id
left outer join sites on datasources.site_id = sites.id
left outer join projects on datasources.project_id = projects.id
left outer join workbooks on datasources.parent_workbook_id = workbooks.id
group by datasources.db_class 
order by count desc ;

Answer: About half of them are Oracle!

July 5, 2023

NEO4J: Using APOC To Load HTML Table Data

I’ve been playing around with loading neo4j data from random tables on the web using apoc.load.html from the extended APOC library. The first trick to it is knowing how to use jquery to find elements of a webpage — the table named “listtable” then the path down to the data elements (tbody tr td) and column numbers.

Once you have extracted the data, you can then manipulate it, map it into fields, create relationships, etc.

UNWIND is used as a “for each” loop that allows us to iterate through the result set.

MERGE creates or updates records (which, in this case, means I have a poor data model … someone could well have run in multiple elections and I am not really accommodating those cases well. Since I don’t actually want a database of presidential elections but was really just testing some new-to-me functionality … we’re going to ignore these logic problems)

SET adds (or updates) properties of the node.

CALL apoc.load.html("https://www.iweblists.com/us/government/PresidentialElectionResults.html",
{electionyear: "#listtable tbody tr td:eq(0)"
	, winner: "#listtable tbody tr td:eq(1)"
	, loser: "#listtable tbody tr td:eq(2)"
	, electoral_win: "#listtable tbody tr td:eq(3)"
	, electoral_lose: "#listtable tbody tr td:eq(4)"
	, popular_win: "#listtable tbody tr td:eq(5)"
	, popular_delta: "#listtable tbody tr td:eq(6)" }) yield value 

WITH value, size(value.electionyear) as rangeup

UNWIND range(0,rangeup) as i WITH value.electionyear[i].text as ElectionYear
	, value.winner[i].text as Winner
	, value.loser[i].text as Loser
	, value.electoral_win[i].text as EC_Winner
	, value.electoral_lose[i].text as EC_Loser
	, value.popular_win[i].text as Pop_Vote_Winner
	, value.popular_delta[i].text as Pop_Vote_Delta

MERGE (ew:Candidate {name: coalesce(Winner,"Unknown")}) 
MERGE (el:Record {name: coalesce(Loser,"Unknown")}) 
SET ew.EC_Votes = coalesce(EC_Winner,"Unknown") 
SET el.EC_Votes = coalesce(EC_Loser,"Unknown")
SET ew.Year = ElectionYear
SET el.Year = ElectionYear

WITH *, replace(Pop_Vote_Delta,",","") as Pop_Vote_Delta_Int, replace(Pop_Vote_Winner,",","") as Pop_Winner_Int

SET ew.Pop_Votes = Pop_Winner_Int
SET el.Pop_Votes = apoc.number.exact.sub(Pop_Winner_Int, Pop_Vote_Delta_Int)

MERGE (ew)-[:DEFEATED]->(el);

June 28, 2023

NEO4J: WITH and Scope

I have encountered another manual reading failure error — if you actually read the Cypher documentation for WITH, it clearly states that entering a WITH block creates a new scope into which previous variables are not imported. Unless you specifically include them. You can individually include variables in this new scope (WITH oldvariable1, oldvariable2, newSomething as newvariable1) or just use * to include all previous variables.

Doing neither will produce an error that a variable that you are absolutely positive exists does not, in fact, exist.

June 23, 2023

Unable to use JStat with Cassandra

We have been having some problems with a Cassandra cluster, so I wanted to look at the java heap space. Unfortunately, jstat cannot find the pid. And, yes, it is the right PID!

Looking in /tmp/hsperfdata_cassandra/, there’s no file! Reading through the whole line where Cassandra is running, I noticed +PerfDisableSharedMem … that’d do it!

It looks like they intentionally set +PerfDisableSharedMem in the Cassandra startup script. I assume their rational is still reasonable … so wouldn’t remove the parameter for day-to-day operation. But, when there’s a problem … restarting Cassandra without this parameter allows us to check how garbage collection is going.

June 23, 2023

Java Heap Stats with JStat

While there are plenty of third-party utilities for looking at the java heap space, I just use jstat (in OpenJDK, this means installing java-<Version>-openjdk-devel

JStat will display the following columns:

--------------------------------------------------------------------------------
S0C: Survivor space 0 size in K
S1C: Survivor space 1 size in K

S0U: Survivor space 0 usage in K
S1U: Survivor space 1 usage in K

--------------------------------------------------------------------------------

EC: Eden space size in K
EU: Eden space usage in K

--------------------------------------------------------------------------------

OC: Old space size in K
OU: Old space usage in K

--------------------------------------------------------------------------------

MC: Meta space size in K
MU: Meta space usage in K

--------------------------------------------------------------------------------

CCSC: CodeCache size in K
CCSU: CodeCache usage in K

--------------------------------------------------------------------------------

YGC: Young generation garbage collection count
YGCT: Young generation garbage collection total time in seconds

FGC: Full garbage collection count
FGCT: Full garbage collection total time in seconds

CGC: Concurrent garbage collection count
CGCT: Concurrent garbage collection time in seconds
GCT: Total garbage collection time in seconds

--------------------------------------------------------------------------------

https://stackoverflow.com/questions/13660871/jvm-garbage-collection-in-young-generation/13661014#13661014 does a good job of explaining the nomenclature & how stuff gets moved around in the heap space

Sample output — this command is for java PID 19356 and will list 100 lines 2 seconds apart (2000 ms)

server01:bin # jstat -gc 19356 2000 100
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
68096.0 68096.0 0.0 64207.5 545344.0 319007.2 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 386674.5 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 457055.4 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 485538.8 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 505893.4 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815

And this is a time where a third-party tool would be helpful but I never really ‘get’ what is and what is not OK to install on servers, so try not to install things — because the *useful* bit of information for any of this is really the usage / size percent utilization value.

That last grouping of stuff — I look at those v/s how long the pid has been running. If you’ve gotten a billion GC’s and the PID has only been running for eight seconds, that is a crazy amount of I/O. If I’ve only had 3 GCs and the pid has been running for seven years, it hasn’t been doing anything. In between? I don’t really find the numbers useful unless I’ve got a baseline from normal operation.