$Id: README 7606 2022-01-08 21:45:07Z flaterco $

    Harmbase 2:  harmonic constant management package.
    Copyright (C) 2004  David Flater.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.


THIS PACKAGE IS AVAILABLE FROM:
https://flaterco.com/xtide/files.html


Introduction
------------

Harmbase is a database application for managing harmonic constants.  I built
Harmbase to solve my own problems, so it's not "plug and play" software.
It's not designed to be useful to someone who does not know how to hack Ruby
code or write interactive SQL queries.


Prerequisites
-------------

Dependency (w/ version tested)         get it from

XTide 2.15.3            https://flaterco.com/xtide/files.html
PostgreSQL 14.1         http://www.postgresql.org/
libdstr 1.0             https://flaterco.com/util/index.html
libtcd 2.2.7-r3         https://flaterco.com/xtide/files.html
congen 1.7-r2           https://flaterco.com/xtide/files.html
Ruby 2.2                Your friendly neighborhood Linux distro
pg-1.2.3.gem            https://rubygems.org/gems/pg/
json-2.5.1.gem          https://rubygems.org/gems/json/

The command-line client tide (of XTide) must be in your PATH to run
fixup_datums and various test scripts.


What gets installed
-------------------

When you run make install, these two binaries get installed in $(prefix)/bin
(alias bindir):

  fixup_datums - estimate MLLW for stations that are missing it
  hbexport - dump the database to a TCD file

To avoid cluttering up bin, miscellaneous scripts are instead placed in
$(prefix)/share/harmbase2 (alias pkgdatadir) along with other dependencies.
This directory typically won't be in your path.

  harmbase2.sql - the harmbase2 schema
  scrape-MDAPI-0-stationlists.sh - download lists of tide and current stations
  scrape-MDAPI-1-tiderefs.rb - download harmonic constants and datums for tides
  scrape-MDAPI-2-geogroups.rb - download hierarchical geographic areas
  scrape-MDAPI-3-currefs.rb - download harmonic constants for currents
  scrape-currents-regions-lists.sh - download HTML lists of current stations
  importNOS.rb - import data from CO-OPS Metadata API (MDAPI)
    importNOS_CurRefs.rb     - part of importNOS.rb
    importNOS_CurSubs.rb     - part of importNOS.rb
    importNOS_DodgyCode.rb   - part of importNOS.rb
    importNOS_Geogroups.rb   - part of importNOS.rb
    importNOS_JsonReaders.rb - part of importNOS.rb
    importNOS_Queries.rb     - part of importNOS.rb
    importNOS_TideRefs.rb    - part of importNOS.rb
    importNOS_TideSubs.rb    - part of importNOS.rb
    importNOS_Util.rb        - part of importNOS.rb
  parse-oldcurlists.rb - get geographic areas from HTML current station lists
  compare_LST_problems.sql - queries for testing
  misc_tests.rb - miscellaneous tests for common import problems
  getPreds.rb - given station ID, download predictions from NOAA web service
  sample.rb - pull predictions from tide and getPreds.rb for comparison
  dump_db.sh - export database to SQL without extraneous bits
  boilerplate.txz - data for dwf-export.sh
  dwf-export.sh - produce distributable SQL and TCD tarballs
  loclist.rb - dump station list to web page
  Debian.sh - build TCD file from SQL "source."
  README - this file


Setting up the schema
---------------------

Harmbase2.sql is the schema with no data sets.

$ createdb harmbase2
$ psql < harmbase2.sql     # schema file contains \connect harmbase2

Alternately, restore a database dump with data sets already in it.

$ psql harmbase2 < harmonics-dwf-*.sql


Importing
---------

As of the 2018 renovation, the import program works only with JSON files that
were downloaded from the CO-OPS Metadata API (MDAPI).  Please see "Details of
import process for NOS data" at the end of this README for the full process.

Usage: importNOS.rb N
N=1  tidal reference stations
N=2  tidal subordinate stations
N=3  current reference stations
N=4  current subordinate stations

Importing current stations may require additional steps to acquire geographic
area information that is not provided through MDAPI.

Always save a snapshot of the database BEFORE running import.  No attempt is
made to leave the database in a clean state when failures occur.

Harmonic constants that were produced by hargen are imported directly using
interactive SQL that harmgen produces.  The import program is not required
for that.


Deleting data sets
------------------

Referencing data in the table CONSTANTS will be deleted automatically via ON
DELETE CASCADE when a reference station is deleted from table DATA_SETS.
However, subordinate stations in the table DATA_SETS must be deleted before
the reference station.  (For safety, deletion does not cascade to sub
stations.)


Exporting
---------

Usage: hbexport [--flip constituent-name] [--optimize] [--free] [--nonfree] [-b YYYY] [-e YYYY] filename

--flip is a guru switch for troubleshooting the definitions of harmonic
constants.  It reverses the phase of the specified constituent.

--optimize causes export to eliminate unused constituents from the TCD file.
Otherwise, all constituents known to the database will be exported.

--free limits the export to data that are marked public domain.  By default
you get all data.

--nonfree limits the export to data that are NOT marked public domain.

-b and -e are used to select beginning and ending years for the harmonics
file.  The defaults are 1700 and 2100 (centered around 1900, the year for
which the speeds of the constituents are calibrated).


Details of import process for NOS data (December 2021)
------------------------------------------------------

These programs, scripts, and process break to some extent every year as the
NOS updates its web site.  This description includes notes on what was fixed
along the way this time.

Save snapshots of the database after every step (db_dump.sh > snapshot.sql).
Revert to the previous snapshot whenever it is necessary to fix something and
rerun (dropdb harmbase2; createdb harmbase2; psql harmbase2 < snapshot.sql).

Review reminders left in 2020.

  Check MDAPI changelog to see if 1.0 remains current.
  https://api.tidesandcurrents.noaa.gov/mdapi/prod/
    Prod is still 1.0 (July, 2018).

  Run test_currents_geogroups.rb.
    No change.

  Watch the effective date for the new tidal datum epoch.
  https://www.tidesandcurrents.noaa.gov/datum-updates/ntde/
  "The current proposed release date for new NTDE products is 2025."
  This affects data but also fixup_datums.
    Still targeting 2025.

  Revise scrape-MDAPI-2-geogroups.rb to detect and retry fetches that return
  binary garbage with an OK status.
    Revised to fetch into separate files but the problem did not reproduce.

  Moved assorted extra scripts of marginal value into hacks_unpublished.

Web scrape.

  scrape-currents-regions-lists.sh
  Run time less than 1 minute.
  This is still the only way to resolve geogroups for new currents.

  scrape-MDAPI-0-stationlists.sh
  Run time less than 1 minute.
  (Note:  The counts can be found on line 2 of the json files.)
    tides.json
      3280 tide stations with offsets for sub stations but metadata only for
      refs.  Caution, includes great lakes.
    currents.json
      4362 current stations with offsets for sub stations but metadata only
      for refs.
    harcon.json
      1180 tide stations with links to get harmonic constants and datums.

  scrape-MDAPI-1-tiderefs.rb
  Run time over 6 hours.
    Read harcon.json and scrape harmonic constants and datums into separate
    directories and files.

  scrape-MDAPI-2-geogroups.rb
  Run time about half an hour.
    geogroups-tides.json
      Source of the multi-level headings in station listings.
    geogroups/NNNN/children.json
      Tide station metadata with pointers back to the relevant geogroup.

  Check for binary garbage in geogroups.  If found, improve script to detect
  and correct it automatically.
  grep -l -r -P "[\x80-\xFF]"
  No corrupt files found.

  scrape-MDAPI-3-currefs.rb
  Run time over 2 hours.
    Read currents.json and scrape harmonic constants as applicable.

Look for errors in web scrape.

  Check for HTTP errors.
  bash-4.3$ fgrep HTTP *refs/scrape_log.txt | fgrep -v "200 OK"
  19 instances of 504 Gateway Timeout (14 in tiderefs, 5 in currefs).
  fgrep '(try' *refs/scrape_log.txt
  Highest try = 6, in currefs.
  The default retry limit is 20.
  No other errors.

  Check for truncated sets of constants.
  In tiderefs:
  bash-4.3$ ls -alS */harcon.json | less
    Matapeake, Kent Island (8572770) is the shortest with 18 constituents.
    Anchorage (9455920) has 120 constituents, descriptions after 37 are null.
  In currefs:
  bash-4.3$ ls -alS */harcon.json* | less
  Oxy Oil and Gas CM (cc0401) was broken before.
  Port Aransas, Channel View (cb0301) now has the same disease.
    "HarmonicConstituents": []
  Galveston Channel, west end (g09010) is next shortest, 15 constituents but
  OK.

  bash-4.3$ grep -Flr '"HarmonicConstituents": []'
  currefs/cb0301/harcon.json?units=english
  currefs/cc0401/harcon.json?units=english

Compare with previous year's download.  Look for any big changes.

  diff -y ../NOS-20201223/geogroups-tides.json geogroups-tides.json
  New geogroups:
  1764 Kotzebue Sound
  1840 Matagorda Bay & San Antonio Bay
  1841 Aransas Bay
  1842 Corpus Christi Bay
  1840-1842 have null level, requiring a fix to importNOS_Geogroups.rb.
    Ran hacks_unpublished/change_geogroup_template.rb.  No fixups needed.

  Deleted geogroup:
  1704 Copper River Delta went away.
    TWC1867 Pete Dahl Slough, Copper River Delta, Alaska
    is now in Gulf of Alaska.
    Leave it.

  1731 St. Lawrence Island parentGeoGroupId changed from 1443 to 1391.
    1391 Alaska is parent of 1443 Bering Sea.
    No fixups needed.

  Some coordinates changed.
  Some offsets that were not null before are null now.
  Some datums got disclaimers saying they are provisional.

  A bunch of stations in tides.json that had state "LA" now have "".
  Only use is in guessState for new tide subs.
  Other batches of changes, reference_id nulled out, types changed.

  In currefs, every constituent now has majorPhaseGMT and minorPhaseGMT.

  New currefs:
  NYH1902 NYH1933 STX1814 STX1821 STX1822 cb1401 cc0601 hb0302 n05010 sn0801
  None of these bring subordinate stations with them.
  s05010 went away.

  oldcurlists:  minor updates.

  No massive changes.

Prepare database.

  Upgrade PostgreSQL; install or update pg and json gems for Ruby.
    Stuck on json-2.5.1 because json-2.6.1 requires Ruby version >= 2.3.

  Reload database from last year.
    bash-4.3$ createdb harmbase2
    bash-4.3$ psql harmbase2 < harmonics-dwf-20210110.sql

  Replace data_sets_old with the previous data.
    DROP TABLE data_sets_old;
    CREATE TABLE data_sets_old AS TABLE data_sets;

  Purge old data.
    delete from data_sets;
    -- drop table currents_geogroups;
    alter table currents_geogroups rename to currents_geogroups_old;

Process oldcurlists to populate currents_geogroups table.  (This table, which
is not in the schema, was created as a temporary workaround that is
increasingly permanent.  See the commented-out block at the end of
scrape-MDAPI-2-geogroups.rb.)

  parse-oldcurlists.rb oldcurlists/*

  Review changes.
  harmbase2=# copy currents_geogroups to '/tmp/currents_geogroups_new.txt';
  harmbase2=# copy currents_geogroups_old to '/tmp/currents_geogroups_old.txt';
  bash-4.3$ diff currents_geogroups_old.txt currents_geogroups_new.txt

  < ACT3831	New York	New York Harbor	Lower Bay
  < ACT3836	New York	New York Harbor	Lower Bay
  > NYH1902	New York	New York Harbor	Lower Bay

  < ACT3641	New York	New York Harbor	Upper Bay
  > n05010	New York	New York Harbor	Upper Bay

  < ACT3881	New Jersey	\N	\N
  > NYH1933	New Jersey	\N	\N

  > cb1401	Virginia	Hampton Roads	Newport News
  > sn0801	Texas	Sabine Pass	\N
  > g10010	Texas	Galveston Bay	\N
  > cc0601	Texas	\N	\N
  > STX1814	Texas	Laguna Madre	\N
  > STX1821	Texas	\N	\N
  > STX1822	Texas	\N	\N
  < s05010	California	\N	\N
  > hb0302	California	Humboldt Bay	\N

  Consistent with minor updates.  No geogroup upheaval.
  drop table currents_geogroups_old;

Make 1-time fixups to old metadata before they carry over.

  Mitigate a few glaring examples of surprising station names resulting from
  the state line running between the named landmark and the supposed location
  of data collection.  None of these have changed coordinates in the latest
  json.

  1. Fort Pulaski is in GA, but these stations are just over the line in SC.
  8670870     Fort Pulaski, Savannah River entrance, South Carolina
  ACT7356_1   Fort Pulaski, Savannah River, South Carolina Current
  --> Fort Pulaski, N of, Savannah River, South Carolina [Current]

  2. The coordinates of ACT6266_1 Virginia Beach, south end are in a
  beachfront lot in North Carolina, miles away from the center of Virginia
  Beach.
  --> Carova Beach, north end, North Carolina Current

  3. Atlantic Heights is a cul-de-sac in NH.
  8423635     Atlantic Heights, Piscataqua River, Maine
  --> Atlantic Heights, E of, Piscataqua River, Maine

  Fix wrong states at TX/LA border (Sabine Pass/Lake/River and offshore).
  Move 8771081, sn0301_5, sn0301_17, and sn0301_30 from TX to LA.
  8771081     Sabine (offshore), Texas
  sn0301 (3 bins)
    Sabine Front Range, Point Hunt, NW of (...), Sabine Pass, Texas Current

  Checkpoint.
    dump_db.sh > database/00-start.sql

Import tide refs.

  importNOS.rb 1 > import1-log.txt

  Diff vs. last year:
  Batch of new stations in TX, VI, and PR.
  LST problems are unchanged (Alaska and Florida).

  Ensure that timezones as assigned by import haven't changed.
  psql harmbase2 -f compare_LST_problems.sql -o compare_LST_problems.txt
  24 this year and last.
  0 rows from third query = OK.

  Checkpoint.
    dump_db.sh > database/01-refs-imported.sql

Edit and run renamings-tiderefs.sql to fix up names of new stations.

  psql harmbase2 < renamings-tiderefs.sql

Post-renamings fixes

  (Geogroups comment printout was modified after the run.)

  Nilled out geogroup 1434 Gulf of Alaska in L6rewrite.

  Added patterns to standardize Vieques Island and Isla de Vieques to just
  Vieques and applied them to data_sets and data_sets_old.

  Checkpoint.
    dump_db.sh > database/02-refs-renamed.sql
    hbexport 02-refs-renamed.tcd

Install estimates for missing datums.

  The NOS import will set the datum to Mean Astronomical Tide (i.e., zero) on
  stations where the MLLW benchmark is unavailable.  When this happens,
  export a database, then run fixup_datums.

  fixup_datums will invoke tide in stats mode to estimate MLLW for each
  affected station over the 19-year epoch 1983-2001 and update the database
  accordingly.

  script -c "fixup_datums 02-refs-renamed.tcd"

  Note that changing meridians can affect the estimated MLLW.

  Checkpoint.
    dump_db.sh > database/03-fixed-datums.sql

Import tide subs.

  importNOS.rb 2 > import2-log.txt

  Diff from last year:
  - Name fixup noise went away
  - Blacklisted station TEC4773 went away

  Unchanged:
  bash-4.3$ fgrep "Time warp" import2-log.txt
  Time warp at 8728912 Port Saint Joe, St. Joseph Bay, Florida
  Time warp at 9462144 North side, Yunaska Island, Alaska
  Time warp at 9462161 East Cove, Yunaska Island, Alaska
  Time warp at 9462239 Applegate Cove, Chuginadak Island, Aleutian Islands, Alaska
  Time warp at 9468123 Fossil River entrance, St. Lawrence Island, Alaska
  Time warp at 9468216 Niyrakpak Lagoon entrance, St. Lawrence Island, Alaska
  Time warp at TWC2223 Herbert Island, west side, Aleutian Islands, Alaska
  Time warp at TWC2229 Amukta Island, north side, Aleutian Islands, Alaska

  Unchanged:
  Fixing longitude of TWC2321 Etienne Bay

  Skipped from blacklist:  112 (TEC4773 went away)
  Skipped from missing refs:  9 (no change)
  Missing refs are (no change):
  1778000 APIA, W. SAMOA (Samoa, dropped by import)
  1841275 MALAKAL HARBOR (Palau, allowable but missing)

Edit and run renamings-tidesubs.sql to fix up names of new stations.

  Empty file.

  Checkpoint.
    dump_db.sh > database/04-tidesubs-imported.sql

Import current refs.

  importNOS.rb 3 > import3-log.txt

  import3-log.txt:
  ** Possible wrong state, NYH1902 40.475799560546875 -74.04969787597656 Chapel Hill Channel, South End
     Geogroup state NY
     Guessed NJ (d=2e-04)
  ** Possible wrong state, n05010 40.67210006713867 -74.03990173339844 Gowanus Flats LBB 32
     Geogroup state NY
     Guessed NJ (d=6e-05)
  ** cb0301 harcon is null; skipping it.
  ** cc0401 harcon is null; skipping it.

  Check for unintended name changes.
  psql harmbase2 < changednames.sql
  At 5 stations the name of the shallowest depth was harmonized with the
  deeper depths.

Edit and run renamings-currefs.sql to fix up names of new stations.

  State lines still being a pain.

  Checkpoint.
    dump_db.sh > database/05-currefs-imported.sql

Import current subs.

  importNOS.rb 4 > import4-log.txt

  Diff from last year:
  - ACT3846_1 and ACT8831_1 now have their reference stations
  - New skipped stations PCT1391_1 and PCT1411_1 were added after the fact
    last year.

  18 time warps unchanged.
    All are AK merid -10 pointing at merid -9.
  16 non-US stations skipped, effectively unchanged.
  2 null multipliers (PCT4416_1, ACT2771_1), unchanged.
  1 station still lacks its reference
    ** Reference station PCT2316_1 for sub PCT2311_1 does not exist

  Check for unintended name changes.
  psql harmbase2 < changednames.sql
  Nothing new.

Edit and run renamings-cursubs.sql to fix up names of new stations.

  psql harmbase2 < renamings-cursubs.sql

  Checkpoint.
    dump_db.sh > database/06-cursubs-imported.sql

Misc tests.

  Run misc_tests.rb and either fix and rerun import or make manual fixes as
  appropriate.

  Checking for duplicate names: pass
  Checking for stations missing states: pass
  Checking for stations with mismatched state names: pass
  Checking for stations that switched hemispheres: pass
  Checking for Pacific islands in the wrong hemisphere: pass
  Checking time zones on St. Lawrence Island: pass
  Checking time zones for Port Saint Joe and White City, FL: pass
  Checking for any subs referring to Port Saint Joe and White City, FL: pass
  Checking for any subs referring to refs with LST problems: pass
  Check for stations that disagree on state: pass
  Check for stations that disagree on coordinates: FAIL

  Lots of churn in coordinates, mostly inconsequential.

  Big mover:  8571072 Ape Hole Creek, Pocomoke Sound, Chesapeake Bay, Maryland
  < 8571072  37.9617  -75.8217    Plausible
  > 8571072  44.7161  -66.5125    Wrong
  The bad data originated from tides.json.
  Added erratum (new errata.sql).

  Count stations and make sure the numbers look reasonable.

  Checkpoint.
    dump_db.sh > database/07-fixed-erratum.sql
    hbexport 07.tcd

  Look at the map and check for unlikely station coordinates.

Coverage tests.

  (Unpublished script coverage.sh)
  Test at least one station of each type:
    Tide ref
    Current ref
    Tide sub, * offsets
    Tide sub, + offsets
    Current sub (* offsets only)
    Tide ref wrong LST (UTC matches, LST/LDT doesn't)
    Current ref wrong LST--don't have any
    Tide sub time warp (LST/LDT matches, UTC doesn't)
    Current sub time warp

  Passed.

  A difference of 19 minutes on a high tide traced to a weak lower high water
  at 1612340 Honolulu Harbor.  The flatter the tide, the bigger the
  difference between xtide and the NOS web site.
  xtide  2022-01-10 19:50 UTC   1.00 feet  High Tide
  xtide  2022-01-11 20:50 UTC   0.79 feet  High Tide
  NOS    2022-01-10 20:04   0.995   H
  NOS    2022-01-11 21:09   0.791   H

Random tests.

  Check random samples using sample.rb until exhausted.  Note that the
  predictions retrieved for comparison with a (sub) station will likely be
  from a corresponding harmonic station and therefore won't match very well.
  The opposite situation also happens, where harmonic constants were
  downloaded but the web form hits a subordinate station.

Fix name glitches discovered during testing.

  Port Protection, Prince of Wales Island., Alaska
  Not new, but the typo is still there in tides.json.

  9464212   Village Cove, St. Paul Island, Bering Sea, Alaska
    --> Village Cove, St. Paul Island, Pribilof Islands, Alaska
  for consistency with
    Walrus Island, 0.5 mi west of, Pribilof Islands, Alaska Current
    SW Point, St. Paul Island, 1 mi off, Pribilof Islands, Alaska Current

  St. Mathew I., southwest coast, Alaska Current
    --> St. Matthew Island, southwest coast, Bering Sea, Alaska Current
  for consistency with
    St. Matthew Island, Bering Sea, Alaska

  (Filed as postscript-renamings.sql)

  Checkpoint.
    dump_db.sh > database/08-final-fixes.sql

Check for reminders, loose ends, TO-DOs, and FIXMEs.

Finish.

  Update ChangeLog in boilerplate.txz and run dwf-export.sh.

  Upload.

  Update station list on web using loclist.rb.

  Update files.html and news.html.


Reminders for December 2022
---------------------------

Add "possibly wrong state" warnings to renamings from tide refs and subs
similar to currents.

Consider automating something to track geogroups revisions.

Vague geogroup 1840, Matagorda Bay & San Antonio Bay, has null level.  If it
becomes l6, nil it out in L6rewrite.

Add step to check and apply errata if still applicable.
