Category: Technology

NEO4J: Using APOC To Load HTML Table Data

I’ve been playing around with loading neo4j data from random tables on the web using apoc.load.html from the extended APOC library. The first trick to it is knowing how to use jquery to find elements of a webpage — the table named “listtable” then the path down to the data elements (tbody tr td) and column numbers.

Once you have extracted the data, you can then manipulate it, map it into fields, create relationships, etc.

UNWIND is used as a “for each” loop that allows us to iterate through the result set.

MERGE creates or updates records (which, in this case, means I have a poor data model … someone could well have run in multiple elections and I am not really accommodating those cases well. Since I don’t actually want a database of presidential elections but was really just testing some new-to-me functionality … we’re going to ignore these logic problems)

SET adds (or updates) properties of the node.

CALL apoc.load.html("https://www.iweblists.com/us/government/PresidentialElectionResults.html",
{electionyear: "#listtable tbody tr td:eq(0)"
	, winner: "#listtable tbody tr td:eq(1)"
	, loser: "#listtable tbody tr td:eq(2)"
	, electoral_win: "#listtable tbody tr td:eq(3)"
	, electoral_lose: "#listtable tbody tr td:eq(4)"
	, popular_win: "#listtable tbody tr td:eq(5)"
	, popular_delta: "#listtable tbody tr td:eq(6)" }) yield value 

WITH value, size(value.electionyear) as rangeup

UNWIND range(0,rangeup) as i WITH value.electionyear[i].text as ElectionYear
	, value.winner[i].text as Winner
	, value.loser[i].text as Loser
	, value.electoral_win[i].text as EC_Winner
	, value.electoral_lose[i].text as EC_Loser
	, value.popular_win[i].text as Pop_Vote_Winner
	, value.popular_delta[i].text as Pop_Vote_Delta

MERGE (ew:Candidate {name: coalesce(Winner,"Unknown")}) 
MERGE (el:Record {name: coalesce(Loser,"Unknown")}) 
SET ew.EC_Votes = coalesce(EC_Winner,"Unknown") 
SET el.EC_Votes = coalesce(EC_Loser,"Unknown")
SET ew.Year = ElectionYear
SET el.Year = ElectionYear

WITH *, replace(Pop_Vote_Delta,",","") as Pop_Vote_Delta_Int, replace(Pop_Vote_Winner,",","") as Pop_Winner_Int

SET ew.Pop_Votes = Pop_Winner_Int
SET el.Pop_Votes = apoc.number.exact.sub(Pop_Winner_Int, Pop_Vote_Delta_Int)

MERGE (ew)-[:DEFEATED]->(el);

NEO4J: WITH and Scope

I have encountered another manual reading failure error — if you actually read the Cypher documentation for WITH, it clearly states that entering a WITH block creates a new scope into which previous variables are not imported. Unless you specifically include them. You can individually include variables in this new scope (WITH oldvariable1, oldvariable2, newSomething as newvariable1) or just use * to include all previous variables.

Doing neither will produce an error that a variable that you are absolutely positive exists does not, in fact, exist.

Unable to use JStat with Cassandra

We have been having some problems with a Cassandra cluster, so I wanted to look at the java heap space. Unfortunately, jstat cannot find the pid. And, yes, it is the right PID!

Looking in /tmp/hsperfdata_cassandra/, there’s no file! Reading through the whole line where Cassandra is running, I noticed +PerfDisableSharedMem … that’d do it!

It looks like they intentionally set +PerfDisableSharedMem in the Cassandra startup script. I assume their rational is still reasonable … so wouldn’t remove the parameter for day-to-day operation. But, when there’s a problem … restarting Cassandra without this parameter allows us to check how garbage collection is going.

 

Java Heap Stats with JStat

While there are plenty of third-party utilities for looking at the java heap space, I just use jstat (in OpenJDK, this means installing java-<Version>-openjdk-devel

JStat will display the following columns:

--------------------------------------------------------------------------------
S0C: Survivor space 0 size in K
S1C: Survivor space 1 size in K

S0U: Survivor space 0 usage in K
S1U: Survivor space 1 usage in K

--------------------------------------------------------------------------------

EC: Eden space size in K
EU: Eden space usage in K

--------------------------------------------------------------------------------

OC: Old space size in K
OU: Old space usage in K

--------------------------------------------------------------------------------

MC: Meta space size in K
MU: Meta space usage in K

--------------------------------------------------------------------------------

CCSC: CodeCache size in K
CCSU: CodeCache usage in K

--------------------------------------------------------------------------------

YGC: Young generation garbage collection count
YGCT: Young generation garbage collection total time in seconds

FGC: Full garbage collection count
FGCT: Full garbage collection total time in seconds

CGC: Concurrent garbage collection count
CGCT: Concurrent garbage collection time in seconds
GCT: Total garbage collection time in seconds

--------------------------------------------------------------------------------

https://stackoverflow.com/questions/13660871/jvm-garbage-collection-in-young-generation/13661014#13661014 does a good job of explaining the nomenclature & how stuff gets moved around in the heap space

Sample output — this command is for java PID 19356 and will list 100 lines 2 seconds apart (2000 ms)

server01:bin # jstat -gc 19356 2000 100
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
68096.0 68096.0 0.0 64207.5 545344.0 319007.2 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 386674.5 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 457055.4 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 485538.8 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815
68096.0 68096.0 0.0 64207.5 545344.0 505893.4 30775744.0 19221750.2 137452.0 124322.4 18860.0 15380.6 324697 14589.985 228 45.830 14635.815

And this is a time where a third-party tool would be helpful but I never really ‘get’ what is and what is not OK to install on servers, so try not to install things — because the *useful* bit of information for any of this is really the usage / size percent utilization value.

That last grouping of stuff — I look at those v/s how long the pid has been running. If you’ve gotten a billion GC’s and the PID has only been running for eight seconds, that is a crazy amount of I/O. If I’ve only had 3 GCs and the pid has been running for seven years, it hasn’t been doing anything. In between? I don’t really find the numbers useful unless I’ve got a baseline from normal operation.

NEO4J: Union Sets

There are a lot of places where cypher queries don’t seem to have an “OR” type of functionality … but you can union two returned sets provided they have the same properties list
MATCH z=(:TOMATO {name: ‘Black Krim’})
RETURN z
UNION
MATCH z=(:TOMATO {name: ‘Cherokee Purple’})
RETURN z;

NEO4J: Debugging a Cypher Query

It is somewhat ironic that I continue to use print statements as my debugging tool of choice when programming but spent a decent bit of time trying to find a cypher query debugger. Just use a print statement — or, in this case, return.

When my query returned an error indicating that the variable isn’t defined even though I copy/pasted the variable name from whence I defined it:

I could just omit the component of the query with the error and try returning this variable

And performing operations on null values may not get me anywhere. Adding a replacement command to drop the commas produces integer values:

Using Git Submodules

Contents

Using Git Submodules

What Are Submodules?

Before You Start

Recursion

Adding Submodules

Performing Operations on All Submodules

Cloning a Repository That Contains Submodules

Making Changes to a Submodule

Pulling Changes Made to Submodules

Including Submodules in ADO Pipelines

Git Submodule Quick Reference Guide

What Are Submodules?

A submodule is a link – it is a pointer at a specific path in the parent repository’s directory structure that references a specific commit in another git repository. The actual git repository is no different than any other repository – what makes it a ‘submodule’ is someone using it inside a parent repository. If I create a really cool PHP function – an Excel parsing utility – I can publish it to a git repository to share it. This repository is not a submodule for me. Someone who wants to use my utility may add my repository as a submodule in their code repo. In their repo, my repository that houses the Excel parsing utility is a submodule. You cannot tell, by looking at a repository, if it is included elsewhere as a submodule.

The example provided above – one individual looking to include someone else’s code in their project – is the typical use case for submodules. But there are other scenarios where isolating different parts of code make sense. In our organization, we cannot deploy code until it passes a security scan. This means problems in one section of the code can hold up development on everything else. Additionally, our code scanning is charged per repository. While we may have a dozen independent tools that happen to reside in a parent folder, it is not cost effective to have dozens of repositories being scanned. Using submodules will allow us to roll up independent tools into a single scanned repository.

The parent repository does not actually track the changes in a submodule. Instead, the parent contains a pointer to a commit identifier in the submodule repository. All changes in the submodule are tracked in its repository.

Before You Start

Before working with submodules, ensure you have an up to date version of git. If, as you interact with remote repositories, you see an error indicating you are using an older version of git:

You are apt to see fatal errors like this:

Recursion

You can have submodules inside of submodules inside of submodules – because of this, most command that interact with submodules have a –recursive flag that will recursively apply the command to any submodules contained within submodules within the parent repository.

Adding Submodules

Begin with a git repository that will serve as the parent – in this example, I am starting with a blank repo that only contains a .gitignore and a readme.md file, but you can start with a repository that’s been under development for years.

We link in submodules using “git submodule add” followed by the repo URL. If you want the folder to have a different name, use “git submodule add URLHERE folderNameForSubmodule”

Once the Tool1 submodule has been added, files from Tool1 are on disk.

Check out a branch in the submodule, or use “git submodule foreach” to check each submodule out to the desired branch (main in this case)

If you use git diff with the –submodule switch, you can see that “Tool1” is being tracked as a submodule.

Add and commit the change – since this is both an entry in the .gitmodules file and the submodule reference to a commit hash, I am using “git add *”

Then push the changes to ADO

Looking in ADO, you will see each submodule folder is represented as a file. The file contents is the hash of the commit to which the submodule points.

Because the “folder” is reflected as a file in the repository, it will not sort with the other folders. You’ll need to look in the file listing to find the submodule file.

Performing Operations on All Submodules

The “git submodule foreach” command allows you to run commands on each submodule in a repository.

You can run any git command in this fromeach loop – as an example, I will checkout a specific branch in all submodules using

git submodule foreach –recursive git checkout main

Cloning a Repository That Contains Submodules

Cloning with –recurse-submodules flag

When cloning a repository with submodules, add –recurse-submodules to your git clone command to pull all of the submodules. To clone my ParentRepo, I use:

git clone –recurse-submodules https://Windstream@dev.azure.com/e0082643/e0082643/_git/ParentRepo

You will see that git checks out the parent project and all included submodules.

Use “git submodule foreach –recursive git checkout main” to check each submodule out to the desired branch (in this case, main).

Cloning without –recurse-submodules

If you forget to include –recurse-submodules in your clone command, or if the submodules have been added to a repository that you have already cloned, you will have empty folders instead of each submodule’s files. A parent repository only has pointers to commits for submodules – until the submodules are initialized, those pointers go nowhere.

Use git submodule update –init –recursive to pull files from the registered submodules.

Making Changes to a Submodule

Working within a submodule’s folder will perform operations in the submodule’s git repository. Working in other folders will perform operations in the parent’s git repository.

To make a change to Tool3, I first need to change into the Tool3 directory.

Make whatever changes are needed. Then add the changes, create a commit, and push the commit to the submodule’s repository. You can create branches, merge branches, and such all within the submodule’s folder to interact with the submodule’s git repository.

If you look at the parent repository, you will see that no changes have been made to it – committing changes to a submodule repository will not kick off the pipeline code scan.

To update the parent repository to point the submodule to your most recent commit, you will need to commit the submodule folder in the parent folder. Change directory into the parent repo – add, commit, and push the changes.

And push the references to the remote using “git submodule update –recursive –remote”

This commit updates the contents of the file representing the repo’s folder – the folder’s file used to reference a commit ending in 6d95 and now it references a commit ending in 6292

Which you can see in the git log of the submodule repository:

Because we have made changes to the parent repository, the pipeline that initiates our code scanning should execute.

Pulling Changes Made to Submodules

To sync changes made by others (or that you’ve made in other locations), you will need to pull the submodules, you need to pull the submodule as well as the parent repo.

To pull (fetch and merge) changes from all upstream submodules, use:

git submodule update –recursive –remote –merge

Using “git submodule status” to view the submodules, you can see the submodule now points to the commit hash from the change we made

Including Submodules in ADO Pipelines

Veracode scanning is initiated through a pipeline. Since our goal is to include files from all of the submodules when scanning the parent repository, we need to ensure those files get bundled into the zip file that is submitted for scanning.

To do so, we need to add a checkout step to the pipeline and set “submodules” to ‘true’

When the submodules are all part of the same ADO project, you do not need to supply additional credentials in the pipeline. Where a different set of credentials are required, you can check out the submodule by passing an extra header with an authorization token.

Git Submodule Quick Reference Guide

Clone repo and all submodules:

git clone –recurse-submodules REPO_URL

cd RepoFolder

git submodule update –init –recursive

git submodule foreach –recursive git checkout main

Add a submodule to a repo:

git submodule add REPO_URL /path/to/folderForSubmodule

git submodule update –init –recursive

git submodule foreach –recursive git checkout main

git add .

git commit -m “Adding submodules”

git push

git submodule update –recursive –remote

Check Out a Branch in All Submodules

git submodule foreach –recursive git checkout main

Committing Change to a Submodule

cd .\submodule_folder

# Make some changes!

git add .

git commit –author=”Lisa Rushworth <Lisa.Rushworth@windstream.com>” -m “Commit Message”

git push

cd ..

git add submodule_folder

git commit -m “Updated submodule”

git push

Pull Updates into All Submodules:

git submodule update –recursive –remote –merge

 

NEO4J: Installing Plugins

Since I have the /plugins directory persisted, I can use wget <URL> to grab a JAR file and place it into the plugins folder. Restart the container and the plugin is active.

Example installing APOC extended plugin:

cd /docker/neo4j/plugins;wget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/5.11.0/apoc-5.11.0-extended.jar;docker stop neo4j;docker start neo4j

NEO4J: APOC Plugin in Docker Container

I wanted to test some of the Awesome Procedures on Cypher (APOC) functions, but first I needed to install the plugins. I needed to modify my docker run statement slightly to persist the /plugins directory and install APOC core:

 

docker run -dit -p 7474:7474 -p 7687:7687 --volume=/docker/neo4j/data:/data --volume=/docker/neo4j/conf:/var/lib/neo4j/conf --volume=/docker/neo4j/plugins:/plugins -e NEO4J_AUTH=none -e NEO4J_apoc_export_file_enabled=true -e NEO4J_apoc_import_file_enabled=true -e NEO4J_apoc_import_file_use__neo4j__config=true -e NEO4J_PLUGINS=\[\"apoc\"\] --name neo4j neo4j